Knuth–Morris–Pratt (KMP) algorithm

Knuth–Morris–Pratt (KMP) algorithm is one of the most elegant and efficient solutions to the pattern matching problem, and understanding it deeply will give you a solid foundation for both string algorithms and dynamic programming intuition.

🧩 The Problem KMP Solves

Given:

  • A text T of length n

  • A pattern P of length m

Example

T = "ABABDABACDABABCABAB"
P = "ABABCABAB"

We want to return the starting index of where the pattern appears in the text (in this case, index 10).

🪜 The Naive Way — and Why It’s Inefficient

The naive algorithm checks each position in the text one by one, comparing characters.

T: A B A B D A B A C D A B A B C A B A B
P: A B A B C A B A B
           ↑ mismatch here

Time Complexity:

Worst case → $O(n × m)$

KMP fixes this by not re-checking characters that it already knows will match.

⚙️ The Core Idea Behind KMP

When a mismatch happens, the KMP algorithm asks:

“Can I reuse part of what I already matched?”

It uses a prefix-suffix relationship inside the pattern to decide how far to skip.

To achieve this, KMP preprocesses the pattern to build an auxiliary array called LPS (Longest Prefix which is also a Suffix).

🧮 The LPS Array Explained

For each index i in the pattern P, $LPS[i]$ = the length of the longest proper prefix of $P[0…i]$ which is also a suffix of $P[0…i]$.

👉 A proper prefix means it doesn’t include the full substring itself.

Example:

Pattern P = "ABABCABAB"

The LPS of P would be

Index 0 1 2 3 4 5 6 7 8
Char A B A B C A B A B
LPS 0 0 1 2 0 1 2 3 4

🧩 How KMP Uses the LPS Array

During matching:

Start comparing pattern P with text T.

On mismatch:

Instead of starting from the beginning of the pattern again,

Jump to the index given by the LPS value of the last matched position because LPS tells longest prefix which is also the suffix. So before a mismatch happens, if there is any part of Prefix P, which matches with the suffix of matched substring of text T, we can directly skip to the start of the suffix which matches with the prefix of Pattern P and skip comparing intermediate characters.

Ex:

T = A B A B D A B A C D A B A B C A B A B
P = A B A B C A B A B

When mismatch happens at index 4 $(D ≠ C)$:

We look up LPS[4-1] = LPS[3] = 2.

So instead of starting over, we jump to index 2 in the pattern. That means: we already know the first two characters will match.

🧑‍💻 Python Implementation

def compute_lps(pattern):
    lps = [0] * len(pattern)
    length = 0  # length of the previous longest prefix suffix
    i = 1

    while i < len(pattern):
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps


def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    lps = compute_lps(pattern)
    i = j = 0  # indexes for text and pattern

    while i < n:
        if text[i] == pattern[j]:
            i += 1
            j += 1
        if j == m:
            print(f"Pattern found at index {i - j}")
            j = lps[j - 1]  # Continue search for next match
        elif i < n and text[i] != pattern[j]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1

📊 Complexity Analysis

Step Time Space
LPS Array Construction O(m) O(m)
Pattern Search O(n) O(1) extra
Total O(n + m) O(m)