Knuth Morris Pratt [Python]
January 17, 2020
Tags: leetcode, strings, algorithmic question, search, python,
The idea behind this alogrithm is actually quiet simple. We just do the naive search more efficiently, we make sure that we dont repeat the work that we have already done.
Let me start with an example, we are given a string abababc
and we want to check if the pattern ababc
exists in the given string. When we start matching from the beginning:
abab
abc
abab
c
Now the character a
at index 4 in the string does not match the character c
at index 4 of the pattern. In the naive algorithm we would start the whole character matching process again from index 1 of the string (we would be matching ababab
c with ababc
) , but if we look at the string and the pattern do we really have to do that? We can resume the matching from the following positions:
abab
abc (matching resumes at index 4)
ab
abc (matching resumes at index 2)
We can do this because we successfully matched abab
which means there are two ab
’s in both the pattern and the string. Now we know that the last ab
in the string matches with the first ab
of the pattern and resume the matching after that.
So basically what we do is create a table which tells us where to resume our matching from in the pattern if there was a mismatch while matching the pattern and the string.
From a different perspective all we are doing is finding suffixes that are prefixes in the pattern so that if there is a mismatch after the suffix we can resume the search from the end of prefix. In the example above we there was a mismatch after we matched abab
. Here ab
is suffix and prefix so we can move back to index 2 in pattern which is just after the prefix ab
and resume our matching process
Read more on Wikipedia, they have a pretty good explaination and example.
Now lets code the table generation
def kmp_table(W):
n = len(W)
T = [0 for _ in range(n)]
T[0] = -1
pos = 0
for i in range(1, n):
if W[i] == W[pos]:
T[i] = T[pos]
else:
T[i] = pos
pos = T[pos]
while pos >= 0 and W[i] != W[pos]:
pos = T[pos]
pos += 1
return T
For building the table the idea is to have two different pointers starting at index 0 and 1, we can think of as one pointer for the prefix and another for the suffix. Basically if the characters at prefix and suffix pointers match we increment the pointers otherwise we keep changing the prefix pointer till it matches the suffix pointer or till it reaches the beginning of the pattern with the help on the table we have been building.
Now that we have our table we can start our check and use this table to get the index to resume the check
def KMP(string, substring):
m = len(substring)
n = len(string)
T = kmp_table(substring)
pos = 0
i = 0
while i < n:
if string[i] != substring[pos]:
pos = T[pos]
if pos < 0:
pos += 1
i += 1
else:
i += 1
pos += 1
if pos == m:
return True
return False