Rabin-Karp string matching
Rabin-Karp is a string matching algorithm that utilizes a rolling hash. In the average case, it will run in linear time. The worst case runtime is quadratic, and occurs when the choice of hashing function combined with the inputs results in many hash collisions.
The Rabin-Karp algorithm is particularly useful when searching for multiple same-length patterns in one text.
Average Time Complexity: O(N + M), where N is the length of the text and M is the length of the pattern
Worst Case Time Complexity: O(N * M + M)
Space Complexity: O(1)
Code
Given:
- pattern: str
- text: str
if len(pattern) == 0:
return 0
if len(pattern) > len(text):
return -1
m = len(pattern)
n = len(text)
base = 256
mod = 10**9 + 7
power = (base**(m - 1)) % mod
# Initialize hashes
pattern_hash = 0
text_hash = 0
for i in range(m):
pattern_hash = (base * pattern_hash + ord(pattern[i])) % mod
text_hash = (base * text_hash + ord(text[i])) % mod
for i in range(n - m + 1):
if pattern_hash == text_hash:
# If the hashes match, verify that the strings match
found = True
for j in range(m):
if text[i + j] != pattern[j]:
found = False
break
if found:
return i
# Remove left character and add right character to hash
if i < n - m:
text_hash = (base * (text_hash - ord(text[i]) * power) + ord(text[i + m])) % mod
if text_hash < 0:
text_hash += mod
return -1
Explanation
In brute force string matching, we try to match each character in the pattern starting from each index in the text. This results in a quadratic runtime.
First iteration:
text | A | A | A | A | A | B |
pattern | A | A | B |
Second iteration:
text | A | A | A | A | A | B |
pattern | A | A | B |
Third iteration:
text | A | A | A | A | A | B |
pattern | A | A | B |
Fourth iteration:
text | A | A | A | A | A | B |
pattern | A | A | B |
We can slide the pattern across the text (length N) in O(N) time. But at each index, we have to check the pattern (length M) against the text window in O(M) time. This results in a quadratic runtime of O(N * M).
A better approach
If we could compare a window of text with the pattern in constant time, we could reduce our runtime to O(N). Of course, we can't compare two strings in constant time. However, we can hash the strings and compare the two hash values (which are just integers) in constant time.
Hashing a string usually takes linear time. In this case, we are only deleting one character and adding one character to the string in each iteration. Thus, we can use a rolling hash function to incrementally change the hash in constant time.
1st iteration | A | A | A | A | A | B |
2nd iteration | A | A | A | A | A | B |
3rd iteration | A | A | A | A | A | B |
4th iteration | A | A | A | A | A | B |
Rolling hash
The hash function requires two integers mod and base . mod is a large prime number and base is typically the size of the character set. power is calculated using those two values along with the length of pattern .
base = 256
mod = 10**9 + 7
power = (base**(m - 1)) % mod
To add a new character to the hash, we will do the following:
hash = (base * hash + char_val) % mod
When we slide our window through the text, we are adding characters to the right side and removing them from the left side. That means by the time a character is being removed, m - 1 other characters have been added, and the hash has been multiplied by base ^ (m - 1) .
To undo this and remove the character, we do:
hash = hash - char_val * power
In the code the addition of the new character and removal of the old are combined into one line. Removing a character can result in a negative value, in which case mod is added to make it positive.
text_hash = (base * (text_hash - ord(text[i]) * power) + ord(text[i + m])) % mod
if text_hash < 0:
text_hash += mod
Hash collisions
Two different strings can hash to the same value. This means that if the text_hash and pattern_hash are equal, we still have to check each character to see if the strings match. If the hash values aren't equal, we know for certain that the strings don't match.
Certain inputs paired with certain hash functions can result in a lot of collisions. With a very high collision frequency, the runtime of the algorithm will approach its worst-case runtime of O(N * M). While this is impossible to combat fully without a perfect hash function, with a good hash function and large enough mod value, the average runtime will be O(N).
Multiple pattern matching
Rabin-Karp is particularly useful for searching a text for multiple different patterns of the same length. Simply hash all of the patterns and put them in a set. Then, you can check if the text_hash is in the pattern set in constant time.
Excluding preprocessing, the runtime of this algorithm does not scale with the number of patterns. So no matter if you are trying to match one pattern or many, the runtime is average O(N) and worst case O(N * M). The preprocessing takes O(K * M) where k is the number of patterns and M is the length of the patterns.