如何实现模式搜索Boyer Moore算法?详细解析和实现

2021年3月28日13:41:28 发表评论 1,696 次浏览

本文概述

模式搜索是计算机科学中的一个重要问题。当我们在记事本/单词文件或浏览器或数据库中搜索字符串时, 将使用模式搜索算法来显示搜索结果。典型的问题陈述将是-

给定文本txt [0..n-1]和模式pat [0..m-1], 编写一个函数search(char pat [], char txt []), 将所有出现的pat []都打印在txt中[]。你可以假设n> m。

例子:

Input:  txt[] = "THIS IS A TEST TEXT"
        pat[] = "TEST"
Output: Pattern found at index 10

Input:  txt[] =  "AABAACAADAABAABA"
        pat[] =  "AABA"
Output: Pattern found at index 0
        Pattern found at index 9
        Pattern found at index 12

在这篇文章中,我们将讨论Boyer Moore模式搜索算法。像KMP和有限自动机算法一样,Boyer Moore算法也对模式进行预处理。

Boyer Moore算法是以下两种方法的组合。

1)坏字符启发式

2)良好的后缀启发式

上述两种启发式方法也可以单独用于搜索文本中的模式。让我们首先理解在博伊尔·摩尔算法中,两种独立的方法是如何一起工作的。如果我们看一下这个简单的算法,它会将模式逐个滑过文本。KMP算法对模式进行预处理,使模式可以移动不止一个。Boyer Moore算法出于同样的原因进行预处理。它处理模式并为这两种启发式创建不同的数组。每一步,它都以两种试探法所建议的最大滑动量来滑动模式。它在每一步都使用了两种启发式中的最佳方法。

与以往的模式搜索算法不同,Boyer Moore算法从模式的最后一个字符开始匹配。

在这篇文章中, 我们将在下一篇文章中讨论不良角色启发式, 并讨论良好后缀启发式。

错误字符启发式

坏字符启发式的概念很简单。与当前模式字符不匹配的文本字符称为坏字符。如果不匹配,我们就改变模式直到-

1)不匹配变成匹配

2)模式P移过不匹配的字符。

案例1 –不匹配成为匹配

我们将查找模式中最后一次出现不匹配字符的位置, 如果模式中存在不匹配字符, 则将移动模式以使其与文本T中的不匹配字符对齐。

情况1

情况1

说明:在上面的示例中, 我们在位置3处出现了不匹配的情况。此处, 我们的不匹配字符为" A"。现在, 我们将搜索模式中最后出现的" A"。我们在模式1的位置(以蓝色显示)获得" A", 这是它的最后一次出现。现在我们将模式改变2次, 以使模式中的" A"与文本中的" A"对齐。

情况2 –模式越过不匹配字符

我们将查找图案中最后一次出现不匹配字符的位置, 如果不存在该字符, 我们将使图案移过不匹配字符。

案例2

案例2

说明:在这里, 我们在位置7处存在不匹配。位置7之前的模式中不存在不匹配字符" C", 因此我们将模式移至位置7, 最终在上述示例中, 我们获得了模式的完美匹配(显示在绿色)。我们这样做是因为" C"不存在于模式中, 因此在位置7之前的每个班次我们都将不匹配, 并且搜索将毫无结果。

在以下实现中, 我们预处理模式并将每个可能出现的字符的最后一次出现存储在大小等于字母大小的数组中。如果根本不存在该字符, 则可能导致移动m(图案的长度)。因此,在最佳情况下,坏字符启发式需要O(n/m)时间。

O(n /米)

C ++

/* C++ Program for Bad Character Heuristic of Boyer 
Moore String Matching Algorithm */
#include <bits/stdc++.h>
using namespace std;
# define NO_OF_CHARS 256 
  
// The preprocessing function for Boyer Moore's 
// bad character heuristic 
void badCharHeuristic( string str, int size, int badchar[NO_OF_CHARS]) 
{ 
     int i; 
  
     // Initialize all occurrences as -1 
     for (i = 0; i < NO_OF_CHARS; i++) 
         badchar[i] = -1; 
  
     // Fill the actual value of last occurrence 
     // of a character 
     for (i = 0; i < size; i++) 
         badchar[( int ) str[i]] = i; 
} 
  
/* A pattern searching function that uses Bad 
Character Heuristic of Boyer Moore Algorithm */
void search( string txt, string pat) 
{ 
     int m = pat.size(); 
     int n = txt.size(); 
  
     int badchar[NO_OF_CHARS]; 
  
     /* Fill the bad character array by calling 
     the preprocessing function badCharHeuristic() 
     for given pattern */
     badCharHeuristic(pat, m, badchar); 
  
     int s = 0; // s is shift of the pattern with 
                 // respect to text 
     while (s <= (n - m)) 
     { 
         int j = m - 1; 
  
         /* Keep reducing index j of pattern while 
         characters of pattern and text are 
         matching at this shift s */
         while (j >= 0 && pat[j] == txt展开) 
             j--; 
  
         /* If the pattern is present at current 
         shift, then index j will become -1 after 
         the above loop */
         if (j < 0) 
         { 
             cout << "pattern occurs at shift = " <<  s << endl; 
  
             /* Shift the pattern so that the next 
             character in text aligns with the last 
             occurrence of it in pattern. 
             The condition s+m < n is necessary for 
             the case when pattern occurs at the end 
             of text */
             s += (s + m < n)? m-badchar[txt展开] : 1; 
  
         } 
  
         else
             /* Shift the pattern so that the bad character 
             in text aligns with the last occurrence of 
             it in pattern. The max function is used to 
             make sure that we get a positive shift. 
             We may get a negative shift if the last 
             occurrence of bad character in pattern 
             is on the right side of the current 
             character. */
             s += max(1, j - badchar[txt展开]); 
     } 
} 
  
/* Driver code */
int main() 
{ 
     string txt= "ABAAABCD" ; 
     string pat = "ABC" ; 
     search(txt, pat); 
     return 0; 
} 
   
  // This code is contributed by rathbhupendra

C

/* C Program for Bad Character Heuristic of Boyer 
    Moore String Matching Algorithm */
# include <limits.h>
# include <string.h>
# include <stdio.h>
  
# define NO_OF_CHARS 256
  
// A utility function to get maximum of two integers
int max ( int a, int b) { return (a > b)? a: b; }
  
// The preprocessing function for Boyer Moore's
// bad character heuristic
void badCharHeuristic( char *str, int size, int badchar[NO_OF_CHARS])
{
     int i;
  
     // Initialize all occurrences as -1
     for (i = 0; i < NO_OF_CHARS; i++)
          badchar[i] = -1;
  
     // Fill the actual value of last occurrence 
     // of a character
     for (i = 0; i < size; i++)
          badchar[( int ) str[i]] = i;
}
  
/* A pattern searching function that uses Bad
    Character Heuristic of Boyer Moore Algorithm */
void search( char *txt, char *pat)
{
     int m = strlen (pat);
     int n = strlen (txt);
  
     int badchar[NO_OF_CHARS];
  
     /* Fill the bad character array by calling 
        the preprocessing function badCharHeuristic() 
        for given pattern */
     badCharHeuristic(pat, m, badchar);
  
     int s = 0;  // s is shift of the pattern with 
                 // respect to text
     while (s <= (n - m))
     {
         int j = m-1;
  
         /* Keep reducing index j of pattern while 
            characters of pattern and text are 
            matching at this shift s */
         while (j >= 0 && pat[j] == txt展开)
             j--;
  
         /* If the pattern is present at current
            shift, then index j will become -1 after
            the above loop */
         if (j < 0)
         {
             printf ( "\n pattern occurs at shift = %d" , s);
  
             /* Shift the pattern so that the next 
                character in text aligns with the last 
                occurrence of it in pattern.
                The condition s+m < n is necessary for 
                the case when pattern occurs at the end 
                of text */
             s += (s+m < n)? m-badchar[txt展开] : 1;
  
         }
  
         else
             /* Shift the pattern so that the bad character
                in text aligns with the last occurrence of
                it in pattern. The max function is used to
                make sure that we get a positive shift. 
                We may get a negative shift if the last 
                occurrence  of bad character in pattern
                is on the right side of the current 
                character. */
             s += max(1, j - badchar[txt展开]);
     }
}
  
/* Driver program to test above function */
int main()
{
     char txt[] = "ABAAABCD" ;
     char pat[] = "ABC" ;
     search(txt, pat);
     return 0;
}

Java

/* Java Program for Bad Character Heuristic of Boyer 
Moore String Matching Algorithm */
  
  
class AWQ{
      
      static int NO_OF_CHARS = 256 ;
       
     //A utility function to get maximum of two integers
      static int max ( int a, int b) { return (a > b)? a: b; }
  
      //The preprocessing function for Boyer Moore's
      //bad character heuristic
      static void badCharHeuristic( char []str, int size, int badchar[])
      {
       int i;
  
       // Initialize all occurrences as -1
       for (i = 0 ; i < NO_OF_CHARS; i++)
            badchar[i] = - 1 ;
  
       // Fill the actual value of last occurrence 
       // of a character
       for (i = 0 ; i < size; i++)
            badchar[( int ) str[i]] = i;
      }
  
      /* A pattern searching function that uses Bad
      Character Heuristic of Boyer Moore Algorithm */
      static void search( char txt[], char pat[])
      {
       int m = pat.length;
       int n = txt.length;
  
       int badchar[] = new int [NO_OF_CHARS];
  
       /* Fill the bad character array by calling 
          the preprocessing function badCharHeuristic() 
          for given pattern */
       badCharHeuristic(pat, m, badchar);
  
       int s = 0 ;  // s is shift of the pattern with 
                   // respect to text
       while (s <= (n - m))
       {
           int j = m- 1 ;
  
           /* Keep reducing index j of pattern while 
              characters of pattern and text are 
              matching at this shift s */
           while (j >= 0 && pat[j] == txt展开)
               j--;
  
           /* If the pattern is present at current
              shift, then index j will become -1 after
              the above loop */
           if (j < 0 )
           {
               System.out.println( "Patterns occur at shift = " + s);
  
               /* Shift the pattern so that the next 
                  character in text aligns with the last 
                  occurrence of it in pattern.
                  The condition s+m < n is necessary for 
                  the case when pattern occurs at the end 
                  of text */
               s += (s+m < n)? m-badchar[txt展开] : 1 ;
  
           }
  
           else
               /* Shift the pattern so that the bad character
                  in text aligns with the last occurrence of
                  it in pattern. The max function is used to
                  make sure that we get a positive shift. 
                  We may get a negative shift if the last 
                  occurrence  of bad character in pattern
                  is on the right side of the current 
                  character. */
               s += max( 1 , j - badchar[txt展开]);
       }
      }
  
      /* Driver program to test above function */
     public static void main(String []args) {
          
          char txt[] = "ABAAABCD" .toCharArray();
          char pat[] = "ABC" .toCharArray();
          search(txt, pat);
     }
}

python

# Python3 Program for Bad Character Heuristic
# of Boyer Moore String Matching Algorithm 
  
NO_OF_CHARS = 256
  
def badCharHeuristic(string, size):
     '''
     The preprocessing function for
     Boyer Moore's bad character heuristic
     '''
  
     # Initialize all occurrence as -1
     badChar = [ - 1 ] * NO_OF_CHARS
  
     # Fill the actual value of last occurrence
     for i in range (size):
         badChar[ ord (string[i])] = i;
  
     # retun initialized list
     return badChar
  
def search(txt, pat):
     '''
     A pattern searching function that uses Bad Character
     Heuristic of Boyer Moore Algorithm
     '''
     m = len (pat)
     n = len (txt)
  
     # create the bad character list by calling 
     # the preprocessing function badCharHeuristic()
     # for given pattern
     badChar = badCharHeuristic(pat, m) 
  
     # s is shift of the pattern with respect to text
     s = 0
     while (s < = n - m):
         j = m - 1
  
         # Keep reducing index j of pattern while 
         # characters of pattern and text are matching
         # at this shift s
         while j> = 0 and pat[j] = = txt展开:
             j - = 1
  
         # If the pattern is present at current shift, # then index j will become -1 after the above loop
         if j< 0 :
             print ( "Pattern occur at shift = {}" . format (s))
  
             '''    
                 Shift the pattern so that the next character in text
                       aligns with the last occurrence of it in pattern.
                 The condition s+m < n is necessary for the case when
                    pattern occurs at the end of text
                '''
             s + = (m - badChar[ ord (txt展开)] if s + m<n else 1 )
         else :
             '''
                Shift the pattern so that the bad character in text
                aligns with the last occurrence of it in pattern. The
                max function is used to make sure that we get a positive
                shift. We may get a negative shift if the last occurrence
                of bad character in pattern is on the right side of the
                current character.
             '''
             s + = max ( 1 , j - badChar[ ord (txt展开)])
  
  
# Driver program to test above function
def main():
     txt = "ABAAABCD"
     pat = "ABC"
     search(txt, pat)
  
if __name__ = = '__main__' :
     main()
  
# This code is contributed by Atul Kumar
# (www.facebook.com/atul.kr.007)

C#

/* C# Program for Bad Character Heuristic of Boyer 
Moore String Matching Algorithm */
  
using System;
public class AWQ{ 
      
     static int NO_OF_CHARS = 256; 
      
     //A utility function to get maximum of two integers 
     static int max ( int a, int b) { return (a > b)? a: b; } 
  
     //The preprocessing function for Boyer Moore's 
     //bad character heuristic 
     static void badCharHeuristic( char []str, int size, int []badchar) 
     { 
     int i; 
  
     // Initialize all occurrences as -1 
     for (i = 0; i < NO_OF_CHARS; i++) 
         badchar[i] = -1; 
  
     // Fill the actual value of last occurrence 
     // of a character 
     for (i = 0; i < size; i++) 
         badchar[( int ) str[i]] = i; 
     } 
  
     /* A pattern searching function that uses Bad 
     Character Heuristic of Boyer Moore Algorithm */
     static void search( char []txt, char []pat) 
     { 
     int m = pat.Length; 
     int n = txt.Length; 
  
     int []badchar = new int [NO_OF_CHARS]; 
  
     /* Fill the bad character array by calling 
         the preprocessing function badCharHeuristic() 
         for given pattern */
     badCharHeuristic(pat, m, badchar); 
  
     int s = 0; // s is shift of the pattern with 
                 // respect to text 
     while (s <= (n - m)) 
     { 
         int j = m-1; 
  
         /* Keep reducing index j of pattern while 
             characters of pattern and text are 
             matching at this shift s */
         while (j >= 0 && pat[j] == txt展开) 
             j--; 
  
         /* If the pattern is present at current 
             shift, then index j will become -1 after 
             the above loop */
         if (j < 0) 
         { 
             Console.WriteLine( "Patterns occur at shift = " + s); 
  
             /* Shift the pattern so that the next 
                 character in text aligns with the last 
                 occurrence of it in pattern. 
                 The condition s+m < n is necessary for 
                 the case when pattern occurs at the end 
                 of text */
             s += (s+m < n)? m-badchar[txt展开] : 1; 
  
         } 
  
         else
             /* Shift the pattern so that the bad character 
                 in text aligns with the last occurrence of 
                 it in pattern. The max function is used to 
                 make sure that we get a positive shift. 
                 We may get a negative shift if the last 
                 occurrence of bad character in pattern 
                 is on the right side of the current 
                 character. */
             s += max(1, j - badchar[txt展开]); 
     } 
     } 
  
     /* Driver program to test above function */
     public static void Main() { 
          
         char []txt = "ABAAABCD" .ToCharArray(); 
         char []pat = "ABC" .ToCharArray(); 
         search(txt, pat); 
     } 
} 
  
// This code is contributed by PrinciRaj19992

输出如下:

pattern occurs at shift = 4

错误字符启发法可能需要

O(百万)

最坏情况下的时间。当文本和模式的所有字符都相同时, 会发生最坏的情况。例如, txt [] =" AAAAAAAAAAAAAAAAAA"和pat [] =" AAAAA"。

Boyer Moore算法|良好的后缀启发式

本文由以下作者共同撰写Atul Kumar。如果发现任何不正确的地方, 或者想分享有关上述主题的更多信息, 请写评论。

木子山

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: