Knuth-Morris-Pratt in Haskell

Published: 2007-04-15T22:00Z
Tags: haskell
License: CC-BY

A request that comes up regularly on the Haskell mailing list is for a function to determine whether one string (the needle) is a substring of another one (the haystack). While there is no such function in the Haskell standard library, it is easy enough to implement:

import Data.List
as `isSubstringOf` bs = any (as `isPrefixOf`) (tails bs)

Unfortunatly, this function has a worst case time complexity of O(length as * length bs). For example if we evaluate

"aaaaaaaaaab" `isSubstringOf` replicate 100 'a'

We will first match 10 characters starting from the first position and fail just before we matched the entire string. Then, starting from the second position, we will match 10 characters again, etc. In total we we will do 11 * 100 = O(length as * length bs) comparisons.

There exists an algorithm called the Knuth-Morris-Pratt string searching algorithm which has a much better, O(length as + length bs), worst case behavior. Unfortunately all descriptions you find of the algorithm rely on building a table, and using random access patterns on it. Not only does this make it impossible to use simple data structures like lists, it also obfuscates the underlying idea.

The idea

The core idea of the algorithm is that we only want to process each character of both strings once. This is done by building a table from the needle, and using that table to determine what should be done after each character of the haystack. Either the entire needle has been matched at that point and we are done, or we get a new position in the table to use for the next character.

So, let's turn the above description into a Haskell datatype!

data KMP a = KMP
      { done :: Bool
      , next :: (a -> KMP a)
      }

Clearly, if we know how to make such a 'table' the matching process is straight forward. We need to apply next to each character and we want to know if any of the intermediate tables are done:

isSubstringOf2 :: Eq a => [a] -> [a] -> Bool
isSubstringOf2 as bs = match (makeTable as) bs
   where  match table []     = done table
          match table (b:bs) = done table || match (next table b) bs

This can be made shorter using functions from the Prelude:

isSubstringOf3 as bs = any done $ scanl next (makeTable as) bs

Making the table

All that is left is to make a table, constructing it using a simple recursive function is not an option

makeTable1 :: Eq a => [a] -> KMP a
makeTable1 []     = KMP True  undefined?
makeTable1 (x:xs) = KMP False (\c -> if c == x then makeTable1 xs else ????)

Because what do we do if we don't have a match? Let's look at an example, the calculation "abc" `isSubstringOf` "aabc" would go something like:

makeTable "abc" = table0
done table0 = False
next table0 'a' = (\c -> if c == 'a' then table1 else ????) 'a' = table1
done table1 = False
next table1 'a' = (\c -> if c == 'b' then table2 else ????) 'a'
                = ???? -- what to do now?

What we should do, is start over, but dropping the first character from the input, in this case that gives

-- start over, now for "abc" `isSubstringOf` "abc"
next table0 'a' = (\c -> if c == 'a' then table1 else ????) 'a' = table1
done table1 = False
next table1 'b' = (\c -> if c == 'b' then table2 else ????) 'b' = table2
done table2 = False
next table2 'b' = (\c -> if c == 'c' then table3 else ????) 'c' = table3
done table3 = True

The trick

At first glance it would seem that we have to reexamine parts of the haystack when we start over. But this is not the case.

If, for example the test of table35 fails, we don't have to move back 35 characters, because we already know what those characters are, namely the characters we matched to get to table35! So the table in case of a failed match is always the same, and we can compute that as well.

Lets look again at the makeTable function. If f is the table we get for a failed match, we call next f the failure function, and pass it along as a second parameter. For the first character, in case of a failed match we simply and start from the beginning for the next character:

makeTable :: Eq a => [a] -> KMP a
makeTable xs = table
   where table = makeTable' xs (const table)

Notice we have tied the knot, table depends on table itself! In Haskell this is not a problem because of lazy evaluation, as long as we don't try to use what is not computed yet.

The makeTable' function is where the real work happens.

makeTable' []     failure = KMP True failure
makeTable' (x:xs) failure = KMP False test
   where  test  c = if c == x then success else failure c
          success = makeTable' xs (next (failure x))

The base case is not very interesting, although we can now use something better than undefined. That becomes useful when looking for multiple matches.

The interesting clause is for (x:xs). The next function compares a character c against x.
Is it the same? Great, move to the table for xs.
Is it different? Then look at the failure function.

Finally, to determine the table for xs, we need a new failure function, describing what would have happened if we started later and ended up at the position after x. We can ask the current failure function what would have happened in that case, next (failure x).

Correctness

It would be nice if we could be sure that what we have constructed is actually a substring matching algorithm. The easiest way to verify that I use a simple QuickCheck property:

prop_isSubstringOf :: [Bool] -> [Bool] -> Bool
prop_isSubstringOf as bs = (as `isSubstringOf` bs) == (as `isSubstringOf2` bs)
> Test.QuickCheck.test prop_isSubstringOf
OK, passed 100 tests.

It seems to work, that's great.

An interesting exercise would be to prove that what I have made here is equivalent to the naïve algorithm using equational reasoning. Also nice would be comparing it to the imperative Knutt-Moris-Pratt algorithm, is this actually KMP? Maybe next time.

footnotes
Actually, this function was recently added under the, in my opnion, wrong name isInfixOf.
It is wrong because while "a is a prexif of b" and "a is a suffix of b" are valid English sentences, there is as far as I know no such thing as "an infix of". Maybe "infix in", but not "of". </rant>

Comments

Another(?) formulation of KMP in a functional style can be found here: http://web.comlab.ox.ac.uk/oucl/work/jeremy.gibbons/publications/index.html#kmp
Richard S. Bird, Jeremy Gibbons and Geraint Jones (1989). Formal Derivation of a Pattern Matching Algorithm. Science of Computer Programming, 12(2):93-104. Or in my PhD thesis, but that is in old-fashioned squiggol, so I'm pretty sure it is unreadable.

Justinx

Excellent article. Haskell works so well for describing how an algorithm works. Thanks for posting!

Oleg and I have written another implementation of KMP search that uses Haskell's type system to guarantee statically that all string and array accesses are safe: http://okmij.org/ftp/Computation/lightweight-dependent-typing.html#Accompanying

Thomas Hartmanx

Knuth Morris Pratt algo doesn't seem to give any better performance than "naive" algo from Data.List

There exists an algorithm called the Knuth-Morris-Pratt string searching algorithm....

I did some tests comparing KMP with isInfixOf from Data.List (which is the same as the "naive" substring function you describe first) but there isn't any noticeable difference in performance.

I guess this means either the naive algo gives performance as good as KMP, or your optimized algo isn't quite right. Or I could be missing something.

import Data.List
-- eg t isSubstringOf1 7 should give [False,False,False,False,False,False,False] t f k = map f [1..k] where f j = isInfixOf (9 : [1..(10^j)]) [1..(10^k)]
-- from Data.List, isInfixOf -- , same as "naive" algo from http://twanvl.nl/blog/haskell/Knuth-Morris-Pratt-in-Haskell needle `isSubstringOf1` haystack = any (isPrefixOf needle) (tails haystack)
-- According to http://twanvl.nl/blog/haskell/Knuth-Morris-Pratt-in-Haskell -- this algo should give better performance than the "naive" one but my tests show no -- difference in running time. Either --- the naive algo gives performance as good as KMP, -- or the optimized algo isn't quite right. -- Or I could be missing something. isSubstringOf2 as bs = any done $ scanl next (makeTable as) bs
makeTable :: Eq a => [a] -> KMP a makeTable xs = table where table = makeTable' xs (const table) makeTable' [] failure = KMP True failure makeTable' (x:xs) failure = KMP False test where test c = if c == x then success else failure c success = makeTable' xs (next (failure x))
data KMP a = KMP { done :: Bool , next :: (a -> KMP a) }
Thomas Hartmanx

oh brother. here's a link to the source code sorry about blog formatting cockup... here's the source http://code.google.com/p/thomashartman-learning/source/browse/trunk/haskell/katas/euler/isInfixOf.hs

Spammerx

nice captcha solution ....

csorozDate: 2011-04-27T12:45Zx

I suppose Thomas Hartman's test function should be:

t f k = map g [1..k]
        where g i = f (9 : [1..(10^i)]) [1..(10^k)]

but with this change isSubstringOf2 is even worse than isSubstringOf1 (inside GHCi).

Daniel FischerDate: 2011-09-07T00:42Zx

Thomas, your example is nearly optimal for the naive algorithm. Both algorithms just compare 9 to each element of haystack once, and once the 1 to 10 in the haystack. The compiler might be able to optimise this KMP implementation enough to match the performance of the simple algorithm, but not necessarily. For KMP to have a significant impact, you must have many partial matches, so that the naive algorithm would compare many items in the haystack multiple times, while KMP would need to compare far fewer items multiple times. You can make KMP's advantage as large as you please with appropriate inputs ((replicate k 'a' ++ "b") `isInfixOf` replicate n 'a'), but for 'generic' inputs, there will be little difference (because there will be few partial matches beyond the first item). For text searching, KMP will do a bit better in comparison, because there are some rather common letter-groups ("theoretically" will produce a lot of partial matches of length two and three in most English texts, for example). Normally, however, partial matches are not too frequent and short, so the difference will remain small. The important point of KMP is that it bounds the worst-case behaviour. If you want something that performs significantly better than the naive algorithm in the generic case, you need an algorithm that doesn't inspect all items in the haystack, like the Boyer-Moore algorithm (excellent for long needles over a smallish - but not too small - alphabet; but needs random access, so not appropriate for Haskell lists, except on types where comparison is very expensive).

Daniel FischerDate: 2011-09-07T00:44Zx

Gah! Formatting failure. Sorry.

Chao XuDate: 2014-04-11T07:20Zx

I feel that this is actually MP instead of KMP algorithm? MP : http://www-igm.univ-mlv.fr/~lecroq/string/node7.html KMP: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html

Reply

(optional)
(optional, will not be revealed)
Name a function of type [[a]] -> [a]:
Use > code for code blocks, @code@ for inline code. Some html is also allowed.