Brute force matching

 

This algorithm moves forward only by one character when a mismatch is found, comparing all of the text string to all of the pattern on every try.

 

i=1, j=1, prev = i

repeat

if text[i] == pattern[j]  then    // Move ahead one character in both

i++, j++

else               // Mismatch

prev++;       // Reset the text start position

i:= prev;          // Restart the text counter

j:=1          // Reset the pattern counter

until ( j > M ) or ( i > N );

KMP algorithm

This algorithm never backs up in the text.  When a mismatch occurs, it resets the pattern counter to a value stored in the auxiliary array next[], which is initialized by processing the pattern against itself.  That is, it does back up in the pattern array, but not necessarily to the beginning.

i=1, j=1

repeat

if j==0 or text[i] == pattern[j] then // Increment both counters

i++, j++

else                         // Mismatch

j:= next[j];            // Reset the pattern counter

until ( j > M ) or ( i > N );

Computing next[]

Note the similarity of this algorithm to KMP itself.  As the pattern is matched against itself, the next[] array is also being filled in.

i:=1; j:=0;

next[1]:= 0;

repeat

if  j==0  or  pattern[i] == pattern[j] then

i++, j++;

next[i] = j

else

j:= next[j];

until i >= M;


Rabin-Karp String Matching

 

Consider characters as digits of a base-d integer (use d >= size of alphabet).  The pattern's value is

value = pattern[1]*dm-1 + pattern[2] * dm-2 + … + pattern[m] * d0

Compute the fingerprint of the pattern := hash(value) mod q, where q is a large prime and the fingerprint of the current part of the text.  If they match, the pattern and current part of the text must still be compared character by character to rule out a "false positive" - that is, a collision of the hash values.  The fingerprint of the pattern is computed once.  The fingerprint of the current part of the text can be computed incrementally (ignoring mod q for now):

xi          = text[i]*dm-1      + text[i+1]* dm-2 + … + text[i+m-1]*d0

xi+1       = text[i+1]* dm-1 + text[i+2]*dm-2 + … + text[i+m]*d0

Multiply xi by d, subtract xi+1, and solve for xi+1 to get

                        xi+1 = d*xi - text[i]*dm + t[i+m]

 

since the middle m-2 characters are the same in both parts - i.e., the new value is based on the old value, minus dm times the character leaving the current part, plus the character entering into the current part.  However, note that this expression contains dm, which can overflow an integer variable; in addition, the original computation of the fingerprints of the pattern and the text can cause overflow.  The following code uses properties of the mod operator to remove the overflow problem.

 

value = 0; // pattern

xi    = 0; // text

 

a = 1;  for (i=1; i<M; i++) { a = d*a % q; } // Compute helper value

 

// initial values of pattern and text

for (i=0; i<M; i++) {

   value = (value*d + pattern[i]) % q;

   xi    = (xi*d    + text[i])    % q;

}

 

// Main loop - check hash values

while not done {

   if ( value == xi ) { // check for false positive, otherwise a match }

   else {

      i++;

      // New value of xi in two steps

      xi = (xi + d*q - text[i]*a) % q;

      xi = (xi*d + text[i+M]) % q;

      // if xi < 0, increment by d*q until it's positive

   }

}

 

Example  (using brute-force calculations)    <space> = 0, A = 1, B = 2, …, Z = 26, d = 27, q = 101

pattern: DOG = 4 * 272 + 15 * 27 + 7 = 3326, mod 101 = 96

text: CAT<space>DOG

                        current text: CAT = 3*272 + 1 * 27 + 20 = 2234, mod 101 = 12

                        current text: AT<space> = 1 * 272 + 20 * 27 + 0 = 1296, mod 101 = 84

                        and so on….