本文概述
模式搜索是计算机科学中的一个重要问题。当我们在记事本/单词文件或浏览器或数据库中搜索字符串时, 将使用模式搜索算法来显示搜索结果。典型的问题陈述将是-
给定文本txt [0..n-1]和模式pat [0..m-1], 编写一个函数search(char pat [], char txt []), 将所有出现的pat []都打印在txt中[]。你可以假设n> m。
例子:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
Input: txt[] = "AABAACAADAABAABA"
pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12
在这篇文章中,我们将讨论Boyer Moore模式搜索算法。像KMP和有限自动机算法一样,Boyer Moore算法也对模式进行预处理。
Boyer Moore算法是以下两种方法的组合。
1)坏字符启发式
2)良好的后缀启发式
上述两种启发式方法也可以单独用于搜索文本中的模式。让我们首先理解在博伊尔·摩尔算法中,两种独立的方法是如何一起工作的。如果我们看一下这个简单的算法,它会将模式逐个滑过文本。KMP算法对模式进行预处理,使模式可以移动不止一个。Boyer Moore算法出于同样的原因进行预处理。它处理模式并为这两种启发式创建不同的数组。每一步,它都以两种试探法所建议的最大滑动量来滑动模式。它在每一步都使用了两种启发式中的最佳方法。
与以往的模式搜索算法不同,Boyer Moore算法从模式的最后一个字符开始匹配。
在这篇文章中, 我们将在下一篇文章中讨论不良角色启发式, 并讨论良好后缀启发式。
错误字符启发式
坏字符启发式的概念很简单。与当前模式字符不匹配的文本字符称为坏字符。如果不匹配,我们就改变模式直到-
1)不匹配变成匹配
2)模式P移过不匹配的字符。
案例1 –不匹配成为匹配
我们将查找模式中最后一次出现不匹配字符的位置, 如果模式中存在不匹配字符, 则将移动模式以使其与文本T中的不匹配字符对齐。
情况1
说明:在上面的示例中, 我们在位置3处出现了不匹配的情况。此处, 我们的不匹配字符为" A"。现在, 我们将搜索模式中最后出现的" A"。我们在模式1的位置(以蓝色显示)获得" A", 这是它的最后一次出现。现在我们将模式改变2次, 以使模式中的" A"与文本中的" A"对齐。
情况2 –模式越过不匹配字符
我们将查找图案中最后一次出现不匹配字符的位置, 如果不存在该字符, 我们将使图案移过不匹配字符。
案例2
说明:在这里, 我们在位置7处存在不匹配。位置7之前的模式中不存在不匹配字符" C", 因此我们将模式移至位置7, 最终在上述示例中, 我们获得了模式的完美匹配(显示在绿色)。我们这样做是因为" C"不存在于模式中, 因此在位置7之前的每个班次我们都将不匹配, 并且搜索将毫无结果。
在以下实现中, 我们预处理模式并将每个可能出现的字符的最后一次出现存储在大小等于字母大小的数组中。如果根本不存在该字符, 则可能导致移动m(图案的长度)。因此,在最佳情况下,坏字符启发式需要O(n/m)时间。
C ++
/* C++ Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
#include <bits/stdc++.h>
using namespace std;
# define NO_OF_CHARS 256
// The preprocessing function for Boyer Moore's
// bad character heuristic
void badCharHeuristic( string str, int size, int badchar[NO_OF_CHARS])
{
int i;
// Initialize all occurrences as -1
for (i = 0; i < NO_OF_CHARS; i++)
badchar[i] = -1;
// Fill the actual value of last occurrence
// of a character
for (i = 0; i < size; i++)
badchar[( int ) str[i]] = i;
}
/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
void search( string txt, string pat)
{
int m = pat.size();
int n = txt.size();
int badchar[NO_OF_CHARS];
/* Fill the bad character array by calling
the preprocessing function badCharHeuristic()
for given pattern */
badCharHeuristic(pat, m, badchar);
int s = 0; // s is shift of the pattern with
// respect to text
while (s <= (n - m))
{
int j = m - 1;
/* Keep reducing index j of pattern while
characters of pattern and text are
matching at this shift s */
while (j >= 0 && pat[j] == txt展开)
j--;
/* If the pattern is present at current
shift, then index j will become -1 after
the above loop */
if (j < 0)
{
cout << "pattern occurs at shift = " << s << endl;
/* Shift the pattern so that the next
character in text aligns with the last
occurrence of it in pattern.
The condition s+m < n is necessary for
the case when pattern occurs at the end
of text */
s += (s + m < n)? m-badchar[txt展开] : 1;
}
else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt展开]);
}
}
/* Driver code */
int main()
{
string txt= "ABAAABCD" ;
string pat = "ABC" ;
search(txt, pat);
return 0;
}
// This code is contributed by rathbhupendra
C
/* C Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
# include <limits.h>
# include <string.h>
# include <stdio.h>
# define NO_OF_CHARS 256
// A utility function to get maximum of two integers
int max ( int a, int b) { return (a > b)? a: b; }
// The preprocessing function for Boyer Moore's
// bad character heuristic
void badCharHeuristic( char *str, int size, int badchar[NO_OF_CHARS])
{
int i;
// Initialize all occurrences as -1
for (i = 0; i < NO_OF_CHARS; i++)
badchar[i] = -1;
// Fill the actual value of last occurrence
// of a character
for (i = 0; i < size; i++)
badchar[( int ) str[i]] = i;
}
/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
void search( char *txt, char *pat)
{
int m = strlen (pat);
int n = strlen (txt);
int badchar[NO_OF_CHARS];
/* Fill the bad character array by calling
the preprocessing function badCharHeuristic()
for given pattern */
badCharHeuristic(pat, m, badchar);
int s = 0; // s is shift of the pattern with
// respect to text
while (s <= (n - m))
{
int j = m-1;
/* Keep reducing index j of pattern while
characters of pattern and text are
matching at this shift s */
while (j >= 0 && pat[j] == txt展开)
j--;
/* If the pattern is present at current
shift, then index j will become -1 after
the above loop */
if (j < 0)
{
printf ( "\n pattern occurs at shift = %d" , s);
/* Shift the pattern so that the next
character in text aligns with the last
occurrence of it in pattern.
The condition s+m < n is necessary for
the case when pattern occurs at the end
of text */
s += (s+m < n)? m-badchar[txt展开] : 1;
}
else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt展开]);
}
}
/* Driver program to test above function */
int main()
{
char txt[] = "ABAAABCD" ;
char pat[] = "ABC" ;
search(txt, pat);
return 0;
}
Java
/* Java Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
class AWQ{
static int NO_OF_CHARS = 256 ;
//A utility function to get maximum of two integers
static int max ( int a, int b) { return (a > b)? a: b; }
//The preprocessing function for Boyer Moore's
//bad character heuristic
static void badCharHeuristic( char []str, int size, int badchar[])
{
int i;
// Initialize all occurrences as -1
for (i = 0 ; i < NO_OF_CHARS; i++)
badchar[i] = - 1 ;
// Fill the actual value of last occurrence
// of a character
for (i = 0 ; i < size; i++)
badchar[( int ) str[i]] = i;
}
/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
static void search( char txt[], char pat[])
{
int m = pat.length;
int n = txt.length;
int badchar[] = new int [NO_OF_CHARS];
/* Fill the bad character array by calling
the preprocessing function badCharHeuristic()
for given pattern */
badCharHeuristic(pat, m, badchar);
int s = 0 ; // s is shift of the pattern with
// respect to text
while (s <= (n - m))
{
int j = m- 1 ;
/* Keep reducing index j of pattern while
characters of pattern and text are
matching at this shift s */
while (j >= 0 && pat[j] == txt展开)
j--;
/* If the pattern is present at current
shift, then index j will become -1 after
the above loop */
if (j < 0 )
{
System.out.println( "Patterns occur at shift = " + s);
/* Shift the pattern so that the next
character in text aligns with the last
occurrence of it in pattern.
The condition s+m < n is necessary for
the case when pattern occurs at the end
of text */
s += (s+m < n)? m-badchar[txt展开] : 1 ;
}
else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max( 1 , j - badchar[txt展开]);
}
}
/* Driver program to test above function */
public static void main(String []args) {
char txt[] = "ABAAABCD" .toCharArray();
char pat[] = "ABC" .toCharArray();
search(txt, pat);
}
}
python
# Python3 Program for Bad Character Heuristic
# of Boyer Moore String Matching Algorithm
NO_OF_CHARS = 256
def badCharHeuristic(string, size):
'''
The preprocessing function for
Boyer Moore's bad character heuristic
'''
# Initialize all occurrence as -1
badChar = [ - 1 ] * NO_OF_CHARS
# Fill the actual value of last occurrence
for i in range (size):
badChar[ ord (string[i])] = i;
# retun initialized list
return badChar
def search(txt, pat):
'''
A pattern searching function that uses Bad Character
Heuristic of Boyer Moore Algorithm
'''
m = len (pat)
n = len (txt)
# create the bad character list by calling
# the preprocessing function badCharHeuristic()
# for given pattern
badChar = badCharHeuristic(pat, m)
# s is shift of the pattern with respect to text
s = 0
while (s < = n - m):
j = m - 1
# Keep reducing index j of pattern while
# characters of pattern and text are matching
# at this shift s
while j> = 0 and pat[j] = = txt展开:
j - = 1
# If the pattern is present at current shift, # then index j will become -1 after the above loop
if j< 0 :
print ( "Pattern occur at shift = {}" . format (s))
'''
Shift the pattern so that the next character in text
aligns with the last occurrence of it in pattern.
The condition s+m < n is necessary for the case when
pattern occurs at the end of text
'''
s + = (m - badChar[ ord (txt展开)] if s + m<n else 1 )
else :
'''
Shift the pattern so that the bad character in text
aligns with the last occurrence of it in pattern. The
max function is used to make sure that we get a positive
shift. We may get a negative shift if the last occurrence
of bad character in pattern is on the right side of the
current character.
'''
s + = max ( 1 , j - badChar[ ord (txt展开)])
# Driver program to test above function
def main():
txt = "ABAAABCD"
pat = "ABC"
search(txt, pat)
if __name__ = = '__main__' :
main()
# This code is contributed by Atul Kumar
# (www.facebook.com/atul.kr.007)
C#
/* C# Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
using System;
public class AWQ{
static int NO_OF_CHARS = 256;
//A utility function to get maximum of two integers
static int max ( int a, int b) { return (a > b)? a: b; }
//The preprocessing function for Boyer Moore's
//bad character heuristic
static void badCharHeuristic( char []str, int size, int []badchar)
{
int i;
// Initialize all occurrences as -1
for (i = 0; i < NO_OF_CHARS; i++)
badchar[i] = -1;
// Fill the actual value of last occurrence
// of a character
for (i = 0; i < size; i++)
badchar[( int ) str[i]] = i;
}
/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
static void search( char []txt, char []pat)
{
int m = pat.Length;
int n = txt.Length;
int []badchar = new int [NO_OF_CHARS];
/* Fill the bad character array by calling
the preprocessing function badCharHeuristic()
for given pattern */
badCharHeuristic(pat, m, badchar);
int s = 0; // s is shift of the pattern with
// respect to text
while (s <= (n - m))
{
int j = m-1;
/* Keep reducing index j of pattern while
characters of pattern and text are
matching at this shift s */
while (j >= 0 && pat[j] == txt展开)
j--;
/* If the pattern is present at current
shift, then index j will become -1 after
the above loop */
if (j < 0)
{
Console.WriteLine( "Patterns occur at shift = " + s);
/* Shift the pattern so that the next
character in text aligns with the last
occurrence of it in pattern.
The condition s+m < n is necessary for
the case when pattern occurs at the end
of text */
s += (s+m < n)? m-badchar[txt展开] : 1;
}
else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt展开]);
}
}
/* Driver program to test above function */
public static void Main() {
char []txt = "ABAAABCD" .ToCharArray();
char []pat = "ABC" .ToCharArray();
search(txt, pat);
}
}
// This code is contributed by PrinciRaj19992
输出如下:
pattern occurs at shift = 4
错误字符启发法可能需要
最坏情况下的时间。当文本和模式的所有字符都相同时, 会发生最坏的情况。例如, txt [] =" AAAAAAAAAAAAAAAAAA"和pat [] =" AAAAA"。
Boyer Moore算法|良好的后缀启发式
本文由以下作者共同撰写Atul Kumar。如果发现任何不正确的地方, 或者想分享有关上述主题的更多信息, 请写评论。