高级算法：模式搜索的KMP算法详细实现

2021年4月15日17:23:04 发表评论 1,407 次浏览

本文概述

C ++
Java
python
C#
的PHP

给定文字txt [0..n-1]和一个模式拍[0..m-1], 写一个函数搜索(char pat [], char txt [])打印所有出现的拍[]in文本文件[]。你可能会认为n>米.

例子：

Input:  txt[] = "THIS IS A TEST TEXT"
        pat[] = "TEST"
Output: Pattern found at index 10

Input:  txt[] =  "AABAACAADAABAABA"
        pat[] =  "AABA"
Output: Pattern found at index 0
        Pattern found at index 9
        Pattern found at index 12

模式搜索是计算机科学中的一个重要问题。当我们在记事本/单词文件或浏览器或数据库中搜索字符串时, 将使用模式搜索算法来显示搜索结果。

我们已经讨论了朴素模式搜索算法以前的帖子。朴素算法的最坏情况复杂度是O(m(n-m + 1))。在最坏的情况下, KMP算法的时间复杂度为O(n)。

KMP(Knuth Morris Pratt)模式搜索

朴素模式搜索算法

如果我们看到许多匹配的字符后跟不匹配的字符, 则效果不佳。以下是一些示例。

txt[] = "AAAAAAAAAAAAAAAAAB"
   pat[] = "AAAAB"

   txt[] = "ABABABCABABABCABABABC"
   pat[] =  "ABABAC" (not a worst case, but a bad case for Naive)

KMP匹配算法使用模式的退化属性(具有相同子模式的模式在模式中出现多次), 并将最坏情况下的复杂度提高到O(n)。 KMP算法的基本思想是：每当检测到不匹配(在某些匹配之后)时, 我们就已经知道下一个窗口文本中的某些字符。我们利用这些信息来避免匹配我们知道会匹配的字符。让我们考虑下面的例子来理解这一点。

Matching Overview
txt = "AAAAABAAABA" 
pat = "AAAA"

We compare first window of txt with pat
txt = "AAAAABAAABA" 
pat = "AAAA"  [Initial position]
We find a match. This is same as Naive String Matching.

In the next step, we compare next window of txt with pat.
txt = "AAAAABAAABA" 
pat =  "AAAA" [Pattern shifted one position]
This is where KMP does optimization over Naive. In this 
second window, we only compare fourth A of pattern
with fourth character of current window of text to decide 
whether current window matches or not. Since we know 
first three characters will anyway match, we skipped 
matching first three characters. 

Need of Preprocessing?
An important question arises from the above explanation, how to know how many characters to be skipped. To know this, we pre-process pattern and prepare an integer array 
lps[] that tells us the count of characters to be skipped.

预处理概述：

KMP算法对pat []进行预处理, 并构造一个大小为m(与模式大小相同)的辅助lps [], 用于在匹配时跳过字符。

名称lps表示最长的正确前缀, 该前缀也带有后缀。正确的前缀是不允许带有完整字符串的前缀。例如, " ABC"的前缀是"", " A", " AB"和" ABC"。适当的前缀为"", " A"和" AB"。字符串的后缀是"", " C", " BC"和" ABC"。

我们在子模式中搜索lps。更清楚地, 我们集中于模式的子字符串, 即前缀和后缀。

对于其中i = 0到m-1的每个子模式pat [0..i], lps [i]存储最大匹配的适当前缀的长度, 该前缀也是子模式pat [0..i]的后缀。 lps [i] = pat [0..i]的最长适当前缀, 也是pat [0..i]的后缀。

注意：lps [i]也可以定义为最长前缀, 这也是适当的后缀。我们需要在一个地方正确使用以确保不考虑整个子字符串。

Examples of lps[] construction:
For the pattern "AAAA", lps[] is [0, 1, 2, 3]

For the pattern "ABCDE", lps[] is [0, 0, 0, 0, 0]

For the pattern "AABAACAABAA", lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

For the pattern "AAACAAAAAC", lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4] 

For the pattern "AAABAAA", lps[] is [0, 1, 2, 0, 1, 2, 3]

搜索算法：

不像

天真的算法

, 在此模式下, 我们将模式滑动一个并在每次移位时比较所有字符, 我们使用lps []中的值来确定要匹配的下一个字符。这个想法是不匹配我们知道会匹配的字符。

如何使用lps []决定下一个位置(或知道要跳过的字符数)？

我们开始将j = 0的pat [j]与当前文本窗口的字符进行比较。
我们保持匹配字符txt [i]和pat [j], 并保持i和j递增, 而pat [j]和txt [i]保持匹配.
当我们看到一个不匹配
- 我们知道字符pat [0..j-1]与txt [i-j…i-1]相匹配(请注意, j以0开头, 仅在存在匹配项时递增)。
- 从上面的定义中我们还知道lps [j-1]是pat [0…j-1]的字符计数, 它们都是正确的前缀和后缀。
- 从以上两点可以得出结论, 我们不需要将这些lps [j-1]字符与txt [i-j…i-1]匹配, 因为我们知道这些字符仍然可以匹配。让我们考虑上面的例子来理解这一点。

txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
lps[] = {0, 1, 2, 3} 

i = 0, j = 0
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 1, j = 1
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 2, j = 2
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
pat[i] and pat[j] match, do i++, j++

i = 3, j = 3
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 4, j = 4
Since j == M, print pattern found and reset j, j = lps[j-1] = lps[3] = 3

Here unlike Naive algorithm, we do not match first three 
characters of this window. Value of lps[j-1] (in above 
step) gave us index of next character to match.
i = 4, j = 3
txt[] = "AAAAABAAABA" 
pat[] =  "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 5, j = 4
Since j == M, print pattern found and reset j, j = lps[j-1] = lps[3] = 3

Again unlike Naive algorithm, we do not match first three 
characters of this window. Value of lps[j-1] (in above 
step) gave us index of next character to match.
i = 5, j = 3
txt[] = "AAAAABAAABA" 
pat[] =   "AAAA"
txt[i] and pat[j] do NOT match and j> 0, change only j
j = lps[j-1] = lps[2] = 2

i = 5, j = 2
txt[] = "AAAAABAAABA" 
pat[] =    "AAAA"
txt[i] and pat[j] do NOT match and j> 0, change only j
j = lps[j-1] = lps[1] = 1 

i = 5, j = 1
txt[] = "AAAAABAAABA" 
pat[] =     "AAAA"
txt[i] and pat[j] do NOT match and j> 0, change only j
j = lps[j-1] = lps[0] = 0

i = 5, j = 0
txt[] = "AAAAABAAABA" 
pat[] =      "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.

i = 6, j = 0
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

i = 7, j = 1
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

We continue this way...

C ++

//C++ program for implementation of KMP pattern searching
//algorithm
#include <bits/stdc++.h>
  
void computeLPSArray( char * pat, int M, int * lps);
  
//Prints occurrences of txt[] in pat[]
void KMPSearch( char * pat, char * txt)
{
     int M = strlen (pat);
     int N = strlen (txt);
  
     //create lps[] that will hold the longest prefix suffix
     //values for pattern
     int lps[M];
  
     //Preprocess the pattern (calculate lps[] array)
     computeLPSArray(pat, M, lps);
  
     int i = 0; //index for txt[]
     int j = 0; //index for pat[]
     while (i <N) {
         if (pat[j] == txt[i]) {
             j++;
             i++;
         }
  
         if (j == M) {
             printf ( "Found pattern at index %d " , i - j);
             j = lps[j - 1];
         }
  
         //mismatch after j matches
         else if (i <N && pat[j] != txt[i]) {
             //Do not match lps[0..lps[j-1]] characters, //they will match anyway
             if (j != 0)
                 j = lps[j - 1];
             else
                 i = i + 1;
         }
     }
}
  
//Fills lps[] for given patttern pat[0..M-1]
void computeLPSArray( char * pat, int M, int * lps)
{
     //length of the previous longest prefix suffix
     int len = 0;
  
     lps[0] = 0; //lps[0] is always 0
  
     //the loop calculates lps[i] for i = 1 to M-1
     int i = 1;
     while (i <M) {
         if (pat[i] == pat[len]) {
             len++;
             lps[i] = len;
             i++;
         }
         else //(pat[i] != pat[len])
         {
             //This is tricky. Consider the example.
             //AAACAAAA and i = 7. The idea is similar
             //to search step.
             if (len != 0) {
                 len = lps[len - 1];
  
                 //Also, note that we do not increment
                 //i here
             }
             else //if (len == 0)
             {
                 lps[i] = 0;
                 i++;
             }
         }
     }
}
  
//Driver program to test above function
int main()
{
     char txt[] = "ABABDABACDABABCABAB" ;
     char pat[] = "ABABCABAB" ;
     KMPSearch(pat, txt);
     return 0;
}

Java

//JAVA program for implementation of KMP pattern
//searching algorithm
  
class KMP_String_Matching {
     void KMPSearch(String pat, String txt)
     {
         int M = pat.length();
         int N = txt.length();
  
         //create lps[] that will hold the longest
         //prefix suffix values for pattern
         int lps[] = new int [M];
         int j = 0 ; //index for pat[]
  
         //Preprocess the pattern (calculate lps[]
         //array)
         computeLPSArray(pat, M, lps);
  
         int i = 0 ; //index for txt[]
         while (i <N) {
             if (pat.charAt(j) == txt.charAt(i)) {
                 j++;
                 i++;
             }
             if (j == M) {
                 System.out.println( "Found pattern "
                                    + "at index " + (i - j));
                 j = lps[j - 1 ];
             }
  
             //mismatch after j matches
             else if (i <N && pat.charAt(j) != txt.charAt(i)) {
                 //Do not match lps[0..lps[j-1]] characters, //they will match anyway
                 if (j != 0 )
                     j = lps[j - 1 ];
                 else
                     i = i + 1 ;
             }
         }
     }
  
     void computeLPSArray(String pat, int M, int lps[])
     {
         //length of the previous longest prefix suffix
         int len = 0 ;
         int i = 1 ;
         lps[ 0 ] = 0 ; //lps[0] is always 0
  
         //the loop calculates lps[i] for i = 1 to M-1
         while (i <M) {
             if (pat.charAt(i) == pat.charAt(len)) {
                 len++;
                 lps[i] = len;
                 i++;
             }
             else //(pat[i] != pat[len])
             {
                 //This is tricky. Consider the example.
                 //AAACAAAA and i = 7. The idea is similar
                 //to search step.
                 if (len != 0 ) {
                     len = lps[len - 1 ];
  
                     //Also, note that we do not increment
                     //i here
                 }
                 else //if (len == 0)
                 {
                     lps[i] = len;
                     i++;
                 }
             }
         }
     }
  
     //Driver program to test above function
     public static void main(String args[])
     {
         String txt = "ABABDABACDABABCABAB" ;
         String pat = "ABABCABAB" ;
         new KMP_String_Matching().KMPSearch(pat, txt);
     }
}
//This code has been contributed by Amit Khandelwal.

python

# Python program for KMP Algorithm
def KMPSearch(pat, txt):
     M = len (pat)
     N = len (txt)
  
     # create lps[] that will hold the longest prefix suffix 
     # values for pattern
     lps = [ 0 ] * M
     j = 0 # index for pat[]
  
     # Preprocess the pattern (calculate lps[] array)
     computeLPSArray(pat, M, lps)
  
     i = 0 # index for txt[]
     while i <N:
         if pat[j] = = txt[i]:
             i + = 1
             j + = 1
  
         if j = = M:
             print ( "Found pattern at index " + str (i - j))
             j = lps[j - 1 ]
  
         # mismatch after j matches
         elif i <N and pat[j] ! = txt[i]:
             # Do not match lps[0..lps[j-1]] characters, # they will match anyway
             if j ! = 0 :
                 j = lps[j - 1 ]
             else :
                 i + = 1
  
def computeLPSArray(pat, M, lps):
     len = 0 # length of the previous longest prefix suffix
  
     lps[ 0 ] # lps[0] is always 0
     i = 1
  
     # the loop calculates lps[i] for i = 1 to M-1
     while i <M:
         if pat[i] = = pat[ len ]:
             len + = 1
             lps[i] = len
             i + = 1
         else :
             # This is tricky. Consider the example.
             # AAACAAAA and i = 7. The idea is similar 
             # to search step.
             if len ! = 0 :
                 len = lps[ len - 1 ]
  
                 # Also, note that we do not increment i here
             else :
                 lps[i] = 0
                 i + = 1
  
txt = "ABABDABACDABABCABAB"
pat = "ABABCABAB"
KMPSearch(pat, txt)
  
# This code is contributed by Bhavya Jain

C#

//C# program for implementation of KMP pattern
//searching algorithm
using System;
  
class GFG {
  
     void KMPSearch( string pat, string txt)
     {
         int M = pat.Length;
         int N = txt.Length;
  
         //create lps[] that will hold the longest
         //prefix suffix values for pattern
         int [] lps = new int [M];
         int j = 0; //index for pat[]
  
         //Preprocess the pattern (calculate lps[]
         //array)
         computeLPSArray(pat, M, lps);
  
         int i = 0; //index for txt[]
         while (i <N) {
             if (pat[j] == txt[i]) {
                 j++;
                 i++;
             }
             if (j == M) {
                 Console.Write( "Found pattern "
                               + "at index " + (i - j));
                 j = lps[j - 1];
             }
  
             //mismatch after j matches
             else if (i <N && pat[j] != txt[i]) {
                 //Do not match lps[0..lps[j-1]] characters, //they will match anyway
                 if (j != 0)
                     j = lps[j - 1];
                 else
                     i = i + 1;
             }
         }
     }
  
     void computeLPSArray( string pat, int M, int [] lps)
     {
         //length of the previous longest prefix suffix
         int len = 0;
         int i = 1;
         lps[0] = 0; //lps[0] is always 0
  
         //the loop calculates lps[i] for i = 1 to M-1
         while (i <M) {
             if (pat[i] == pat[len]) {
                 len++;
                 lps[i] = len;
                 i++;
             }
             else //(pat[i] != pat[len])
             {
                 //This is tricky. Consider the example.
                 //AAACAAAA and i = 7. The idea is similar
                 //to search step.
                 if (len != 0) {
                     len = lps[len - 1];
  
                     //Also, note that we do not increment
                     //i here
                 }
                 else //if (len == 0)
                 {
                     lps[i] = len;
                     i++;
                 }
             }
         }
     }
  
     //Driver program to test above function
     public static void Main()
     {
         string txt = "ABABDABACDABABCABAB" ;
         string pat = "ABABCABAB" ;
         new GFG().KMPSearch(pat, txt);
     }
}
  
//This code has been contributed by Amit Khandelwal.

的PHP

<?php
//PHP program for implementation of KMP pattern searching
//algorithm
  
  
//Prints occurrences of txt[] in pat[]
function KMPSearch( $pat , $txt )
{
     $M = strlen ( $pat );
     $N = strlen ( $txt );
  
     //create lps[] that will hold the longest prefix suffix
     //values for pattern
     $lps = array_fill (0, $M , 0);
  
     //Preprocess the pattern (calculate lps[] array)
     computeLPSArray( $pat , $M , $lps );
  
     $i = 0; //index for txt[]
     $j = 0; //index for pat[]
     while ( $i <$N ) {
         if ( $pat [ $j ] == $txt [ $i ]) {
             $j ++;
             $i ++;
         }
  
         if ( $j == $M ) {
             printf( "Found pattern at index " .( $i - $j ));
             $j = $lps [ $j - 1];
         }
  
         //mismatch after j matches
         else if ( $i <$N && $pat [ $j ] != $txt [ $i ]) {
             //Do not match lps[0..lps[j-1]] characters, //they will match anyway
             if ( $j != 0)
                 $j = $lps [ $j - 1];
             else
                 $i = $i + 1;
         }
     }
}
  
//Fills lps[] for given patttern pat[0..M-1]
function computeLPSArray( $pat , $M , & $lps )
{
     //length of the previous longest prefix suffix
     $len = 0;
  
     $lps [0] = 0; //lps[0] is always 0
  
     //the loop calculates lps[i] for i = 1 to M-1
     $i = 1;
     while ( $i <$M ) {
         if ( $pat [ $i ] == $pat [ $len ]) {
             $len ++;
             $lps [ $i ] = $len ;
             $i ++;
         }
         else //(pat[i] != pat[len])
         {
             //This is tricky. Consider the example.
             //AAACAAAA and i = 7. The idea is similar
             //to search step.
             if ( $len != 0) {
                 $len = $lps [ $len - 1];
  
                 //Also, note that we do not increment
                 //i here
             }
             else //if (len == 0)
             {
                 $lps [ $i ] = 0;
                 $i ++;
             }
         }
     }
}
  
//Driver program to test above function
  
     $txt = "ABABDABACDABABCABAB" ;
     $pat = "ABABCABAB" ;
     KMPSearch( $pat , $txt );
      
//This code is contributed by chandan_jnu
?>

输出如下：

Found pattern at index 10

预处理算法：

在预处理部分, 我们用lps []计算值。为此, 我们跟踪上一个索引的最长前缀后缀值的长度(为此, 我们使用len变量)。我们将lps [0]和len初始化为0。如果pat [len]和pat [i]匹配, 我们将len加1, 并将增加的值分配给lps [i]。如果pat [i]和pat [len]不匹配且len不为0, 则将len更新为lps [len-1]。有关详细信息, 请参见下面的代码中的computeLPSArray()。

预处理图解(或构建lps [])

pat[] = "AAACAAAA"

len = 0, i  = 0.
lps[0] is always 0, we move 
to i = 1

len = 0, i  = 1.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 1, lps[1] = 1, i = 2

len = 1, i  = 2.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 2, lps[2] = 2, i = 3

len = 2, i  = 3.
Since pat[len] and pat[i] do not match, and len> 0, set len = lps[len-1] = lps[1] = 1

len = 1, i  = 3.
Since pat[len] and pat[i] do not match and len> 0, len = lps[len-1] = lps[0] = 0

len = 0, i  = 3.
Since pat[len] and pat[i] do not match and len = 0, Set lps[3] = 0 and i = 4.
We know that characters pat
len = 0, i  = 4.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 1, lps[4] = 1, i = 5

len = 1, i  = 5.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 2, lps[5] = 2, i = 6

len = 2, i  = 6.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 3, lps[6] = 3, i = 7

len = 3, i  = 7.
Since pat[len] and pat[i] do not match and len> 0, set len = lps[len-1] = lps[2] = 2

len = 2, i  = 7.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 3, lps[7] = 3, i = 8

We stop here as we have constructed the whole lps[].

如果发现任何不正确的地方, 或者想分享有关上述主题的更多信息, 请写评论。