Knuth Morris Pratt [Python]

January 17, 2020
Tags: leetcode, strings, algorithmic question, search, python,

The idea behind this alogrithm is actually quiet simple. We just do the naive search more efficiently, we make sure that we dont repeat the work that we have already done.

Let me start with an example, we are given a string abababc and we want to check if the pattern ababc exists in the given string. When we start matching from the beginning:

abababc

ababc

Now the character a at index 4 in the string does not match the character c at index 4 of the pattern. In the naive algorithm we would start the whole character matching process again from index 1 of the string (we would be matching abababc with ababc) , but if we look at the string and the pattern do we really have to do that? We can resume the matching from the following positions:

abababc (matching resumes at index 4)

ababc (matching resumes at index 2)

We can do this because we successfully matched abab which means there are two ab’s in both the pattern and the string. Now we know that the last ab in the string matches with the first ab of the pattern and resume the matching after that.

So basically what we do is create a table which tells us where to resume our matching from in the pattern if there was a mismatch while matching the pattern and the string.

From a different perspective all we are doing is finding suffixes that are prefixes in the pattern so that if there is a mismatch after the suffix we can resume the search from the end of prefix. In the example above we there was a mismatch after we matched abab. Here ab is suffix and prefix so we can move back to index 2 in pattern which is just after the prefix ab and resume our matching process

Read more on Wikipedia, they have a pretty good explaination and example.

Now lets code the table generation

def kmp_table(W):
	n = len(W)
	T = [0 for _ in range(n)]
	T[0] = -1
	pos = 0
	
	for i in range(1, n):
		if W[i] == W[pos]:
			T[i] = T[pos]
		else:
			T[i]  = pos
			pos = T[pos]
			while pos >= 0 and W[i] != W[pos]:
				pos = T[pos]
		pos += 1
	
	return T

For building the table the idea is to have two different pointers starting at index 0 and 1, we can think of as one pointer for the prefix and another for the suffix. Basically if the characters at prefix and suffix pointers match we increment the pointers otherwise we keep changing the prefix pointer till it matches the suffix pointer or till it reaches the beginning of the pattern with the help on the table we have been building.

Now that we have our table we can start our check and use this table to get the index to resume the check

def KMP(string, substring):
	m = len(substring)
	n = len(string)
    T = kmp_table(substring)
	
	pos = 0
	i = 0
	
	while i < n:
		if string[i] != substring[pos]:
			pos = T[pos]
			if pos < 0:
				pos += 1
				i += 1
		else:
			i += 1
			pos += 1
			
			if pos == m:
				return True
	return False