Substring Problem

Problem Background

Given a long string source and a short string target, we want to find all the start indices of target’s occurrences in source.

The brute-force solution is to enumerate all the possible start indices of source and check whether it can match target.

def find_substring(source: str, target: str) -> List[int]:
    n, m = len(source), len(target)
    indices = []
    for i in range(n - m + 1):
        if source[i:i + m] == target:
            res.append(i)
    return indices

Certainly, the time complexity of this solution is \(O(n \cdot m)\), where \(n\) is the length of source and \(m\) is the length of target.

Rabin-Karp Algorithm (Rolling Hash)

The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find any one of a set of pattern strings in a text. It is a randomized algorithm, and it is often used with another algorithm to solve the string-matching problem.

The idea of the Rabin-Karp algorithm is to use a rolling hash to compare the hash value of target with the hash value of each substring of source with the same length as target. The rolling hash is a hash function where the input is hashed in a window that moves through the input.

String Hashing

The hash function that maps a string to a number is called a string hashing. The most common hash function is the polynomial rolling hash function, which is defined as follows:

\[H(s) = \sum_{i=0}^{n-1} s[i] \cdot b^{n-1-i} (\mod m)\]

where

  • \(s[i]\) is the ASCII value of the \(i\)-th character of the string s (notice: the value should starting from 1 to calculate the hash value)

  • \(b\): the base of the polynomial hash function, not necessarily a prime number, but larger than the maximum value of \(s[i]\) (for example, \(b = 256\) for ASCII characters, for lower-case English letters, \(b = 26\) is enough)

  • \(m\): the modulus of the polynomial hash function, which is usually a large prime number (at least larger than the biggest value of \(s[i]\), I usually use \(10^9 + 7\))

So here we give an example to calculate the hash of target:

# init
h_target = 0
for c in target:
    # increase the power, append new character and mod m
    h_target = (h_target * b + ord(c)) % m

Observation: the highest power of \(b\) in the polynomial hash function is \(b^{n-1}\), where \(n\) is the length of the string.

Update the Hash Value

We image a sliding window of length \(m\) moving from the left to the right of the string source. When the window moves from position \(i\) to position \(i+1\), we need to update the hash value of the substring source[i:i+m] to the hash value of the substring source[i+1:i+m+1]. Intuitively, we pop the first character of the substring and push the next character into the substring, and then update the hash value. So there is two steps to update the hash value:

  • Remove the source[i] from the hash value: subtract source[i] * b^{n-1} from the hash value

  • Add the source[i+m] to the hash value: multiply the hash value by b and add source[i+m]

def match(source: str, h_target: int, target_len: int) -> List[int]:
    res = []
    n = len(source)
    # highest power of b
    p = pow(b, target_len - 1, m)
    for i in range(n):
        if i >= target_len:
            # remove the first character
            h_source = (h_source - ord(source[i - target_len]) * p) % m
        # add the next character
        h_source = (h_source * b + ord(source[i])) % m
        if i >= target_len - 1 and h_source == h_target:
            res.append(i - target_len + 1)
    return res

Compared to other linear-time string-matching algorithms (e.g., KMP, AC automation, Z algorithm)

Pros:

  • Easy to understand, implement and explain the theory in it.

To be honest, I can’t memorize every theotical details and implementation of KMP. So I prefer rolling hash when I come across the string-matching problem in an interview.

Cons:

  • The hash value may collide: two different strings can have the same hash value. Be careful to choose the base and the modulus of the polynomial hash function.

  • You must know the length of the target in advance. It’s not suitable for the problem that the length of the target is not fixed.

Example

LC 686. Repeated String Match is a typical matching problem. Repeat the string a to at most len(a) + len(b) length and check the first occurrence of b in the repeated string.

class Solution:
    def repeatedStringMatch(self, a: str, b: str) -> int:
        base, M = 26, 10 ** 9 + 7
        target = 0
        n, m = len(a), len(b)
        for c in b:
            target = (target * base + ord(c) - ord('a') + 1 ) % M

        h = 0
        i = 0
        p = pow(base, m - 1, M)
        while i< n + m:
            if i >= m:
                last = a[(i - m) % n]
                h = (h - (ord(last) - ord('a') + 1) * p) % M
            h = (h * base + ord(a[i%n]) - ord('a') + 1) % M
            if h == target:
                # math.ceil((i+1) / n)
                return (i+n) // n
            i += 1
        return -1

Multi-length substrings example: 3504. Longest Palindrome After Substring Concatenation II. As there are too many substrings to memorize, I use large modulo (\(2^{61} - 1\)) to avoid hash collision.

M = (1<<61) - 1
base = 31

def build_hash_table(s):
    d = set()
    for i in range(len(s)):
        prefix = 0
        for j in range(i, len(s)):
            prefix = (prefix * base + ord(s[j]) - ord('a') + 1) % M
            d.add(prefix)
    return d


class Solution:
    def longestPalindrome(self, s: str, t: str) -> int:
        suffix = build_hash_table(t[::-1])
        prefix = build_hash_table(s)
        
        def check(s, suffix):
            p = []
            res = 0
            for i in range(len(s)):
                
                max_len = 0
                for j in range(i+1, len(s)+1):
                    if s[i:j] == s[i:j][::-1]:
                        max_len = max(max_len, j - i)
                res = max(res, max_len)
                for j in range(i):
                    if p[j] in suffix:
                        res = max(res, max_len + (len(p) - j) * 2)
                        break
                p = [(j * base + ord(s[i]) - ord('a') + 1) % M for j in p]
                p.append((ord(s[i]) - ord('a') + 1) % M)
            for i in range(len(s)):
                if p[i] in suffix:
                    res = max(res, (len(p) - i) * 2)
            return res
        
        return max(check(s, suffix), check(t[::-1], prefix))

Exercises:

LC2851. String Transformation: Use a substring matching alorithm (rolling hash or KMP) to determine which indices i can let s[i:] + s[:i] equal to t. (Hint: another point is fast exponentiation to quickly count the ways for state transition in a dynamic programming way.)

Kunth-Morris-Pratt (KMP) Algorithm

The KMP algorithm is another string-searching algorithm that uses a preprocessed table to avoid the backtracking of the brute-force solution.

The key idea of the KMP algorithm is to preprocess the target string to build a table next that records the length of the longest proper prefix of the substring target[:i] that is also a suffix of the substring target[:i].

Build the Next Table

The next table is built by the following rules:

  • next[0] = -1

  • For i > 0, if target[i] == target[j], then next[i] = j + 1

  • If target[i] != target[j], then we need to backtrack j to next[j] until we find a match or j becomes `-1