Knuth-Morris-Pratt (KMP) string matching
KMP is a string matching algorithm that runs in linear time. When the pattern and text have many repeated substrings, KMP is vastly more efficient than the naive string matching algorithm.
Time Complexity: O(N + M), where N is the length of the text and M is the length of the pattern
Space Complexity: O(M)
Code
Given:
- pattern: str
- text: str
if len(pattern) == 0:
return 0
# Preprocess the pattern to find the longest prefix suffix array
lps = [0] * len(pattern)
prev = 0
for i in range(1, len(pattern)):
while pattern[i] != pattern[prev] and prev > 0:
prev = lps[prev - 1]
if pattern[i] == pattern[prev]:
lps[i] = prev + 1
prev += 1
# Match the pattern against the text
prev = 0
for i in range(len(text)):
while text[i] != pattern[prev] and prev > 0:
prev = lps[prev - 1]
if text[i] == pattern[prev]:
prev += 1
if prev == len(pattern):
return i + 1 - len(pattern)
return -1
Explanation
Naive string matching is inefficient because it checks the entire pattern starting from every index of the text. This means that it will require N * M comparisons in the worst case.
We can see this with the following example:
text | A | A | A | A | A | A | A | A |
pattern | A | A | A | A | B |
Naive string matching will try to match the pattern to the substring of text starting at index 0. It will match the first 4 As from pattern, and then hit a mismatch.
text | A | A | A | A | A | A | A | A |
pattern | A | A | A | A | B |
Now it will advance 1 index. It will again match the first 4 As from pattern, and then hit another mismatch.
text | A | A | A | A | A | A | A | A |
pattern | A | A | A | A | B |
It will continue to try to match in this manner for each starting index in the text, checking the first 4 characters of the pattern and then hitting a mismatch at the 5th.
Notice that the only new character we are checking in the second iteration is the last one. We already saw the preceding characters in the first iteration.
Why then, do we have to restart the pattern matching from the beginning when we hit a mismatch? On our second attempt matching, we already know that 4 of the previous characters match the start of the pattern.
In other words, the suffix from the first matching attempt AAAAA is also a prefix to the pattern AAAAB.
text | A | A | A | A | A | A | A | A |
pattern | A | A | A | A | B |
The idea behind the Knuth-Morris-Pratt algorithm is that when a mismatch occurs, we don't have to restart the pattern matching from the beginning. We can identify cases like the one above, where a suffix of the previous match is also a prefix of the pattern. This is why we create a Longest Prefix Suffix array.
LPS array
Here are some example patterns and their corresponding LPS arrays.
pattern = "ABCD"
lps = [0, 0, 0, 0]
pattern = "ABCABZ"
lps = [0, 0, 0, 1, 2, 0]
pattern = "AAAAB"
lps = [0, 1, 2, 3, 0]
pattern = "AAABAAAA"
lps = [0, 1, 2, 0, 1, 2, 3, 3]
Take this example:
pattern | A | B | C | A | B | Z |
lps | 0 | 0 | 0 | 1 | 2 | 0 |
LPS[3] = 1 because the longest prefix suffix ending at index 3 is ABCABZ (length = 1).
pattern | A | B | C | A | B | Z |
lps | 0 | 0 | 0 | 1 | 2 | 0 |
LPS[4] = 2 because the longest prefix suffix ending at index 4 is ABCABZ (length = 2).
pattern | A | B | C | A | B | Z |
lps | 0 | 0 | 0 | 1 | 2 | 0 |
Matching process
Now let's see how this LPS array helps us in the string matching process. In the example below, we can match starting from the beginning.
text | A | B | C | A | B | C | A | B | Z |
pattern | A | B | C | A | B | Z | |||
lps | 0 | 0 | 0 | 1 | 2 | 0 |
After 5 characters, we hit a mismatch. In the naive algorithm, we would advance the pattern by 1. But our LPS array tells us that even though there is a mismatch, the first 2 characters of our pattern ('AB') are a suffix to our current matching attempt.
Which means that we can jump back to the second index of pattern and continue our matching attempt.
text | A | B | C | A | B | C | A | B | Z |
pattern | A | B | C | A | B | Z | |||
lps | 0 | 0 | 0 | 1 | 2 | 0 |
It should be obvious that this is much more efficient than the naive solution.