Brute force matching
This algorithm moves forward only by one
character when a mismatch is found, comparing all of the text string to all of
the pattern on every try.
i=1,
j=1, prev = i
repeat
if text[i] == pattern[j] then // Move ahead one
character in both
i++, j++
else // Mismatch
prev++; // Reset the text start position
i:= prev; // Restart the text counter
j:=1 // Reset the pattern counter
until
( j > M ) or ( i > N );
KMP algorithm
This algorithm never backs up in the text. When a mismatch occurs, it resets the
pattern counter to a value stored in the auxiliary array next[], which is
initialized by processing the pattern against itself. That is, it does back up in the pattern array, but not necessarily
to the beginning.
i=1,
j=1
repeat
if j==0 or text[i] == pattern[j] then // Increment both counters
i++, j++
else // Mismatch
j:= next[j]; // Reset the pattern counter
until
( j > M ) or ( i > N );
Computing
next[]
Note the similarity of this algorithm to KMP
itself. As the pattern is matched
against itself, the next[] array is also being filled in.
i:=1;
j:=0;
next[1]:=
0;
repeat
if j==0
or pattern[i] == pattern[j] then
i++, j++;
next[i] = j
else
j:= next[j];
until
i >= M;
Rabin-Karp String Matching
Consider characters as digits of a base-d integer (use d >= size of alphabet). The pattern's value is
value =
pattern[1]*dm-1 + pattern[2] * dm-2 +
+ pattern[m] * d0
Compute the fingerprint of the pattern := hash(value) mod q, where q is a large prime and the fingerprint of the current part of the text. If they match, the pattern and current part of the text must still be compared character by character to rule out a "false positive" - that is, a collision of the hash values. The fingerprint of the pattern is computed once. The fingerprint of the current part of the text can be computed incrementally (ignoring mod q for now):
xi = text[i]*dm-1 + text[i+1]* dm-2 +
+
text[i+m-1]*d0
xi+1 = text[i+1]* dm-1 + text[i+2]*dm-2 + + text[i+m]*d0
Multiply xi by d, subtract xi+1, and solve for xi+1 to get
xi+1
= d*xi - text[i]*dm + t[i+m]
since the middle m-2 characters are the same in both parts - i.e., the new value is based on the old value, minus dm times the character leaving the current part, plus the character entering into the current part. However, note that this expression contains dm, which can overflow an integer variable; in addition, the original computation of the fingerprints of the pattern and the text can cause overflow. The following code uses properties of the mod operator to remove the overflow problem.
value
= 0; // pattern
xi = 0; // text
a
= 1; for (i=1; i<M; i++) { a = d*a %
q; } // Compute helper value
//
initial values of pattern and text
for
(i=0; i<M; i++) {
value = (value*d + pattern[i]) % q;
xi
= (xi*d + text[i]) % q;
}
//
Main loop - check hash values
while
not done {
if ( value == xi ) { // check for false
positive, otherwise a match }
else {
i++;
// New value of xi in two steps
xi = (xi + d*q - text[i]*a) % q;
xi = (xi*d + text[i+M]) % q;
// if xi < 0, increment by d*q until
it's positive
}
}
Example (using brute-force calculations) <space> = 0, A = 1, B = 2, , Z = 26, d = 27, q = 101
pattern: DOG = 4 * 272 + 15 * 27 + 7 = 3326, mod 101 = 96
text: CAT<space>DOG
current text: CAT = 3*272 + 1 * 27 + 20 = 2234, mod 101 = 12
current text: AT<space> = 1 * 272 + 20 * 27 + 0 = 1296, mod 101 = 84
and so on .