Substring Matching Solution 2
This is a solution to a problem discussed at the 2025 research workshop. The goal is to prove that contains as substrings.
We present a scheme for substring matching which uses gates, where is the complexity of a hash function on inputs (which formally we assume to be a PRF).
Problem statement / notation
We have input strings of length and of length (), and want to write an arithmetic circuit (with nondeterministic inputs) to prove that all the are substrings of .
The Merkle tree algorithm can do this with complexity .
Algorithm
For each , let be the index at which appears in . We provide these as witnesses to the circuit in binary decomposition.
We pick a "random" . Define the sets: We can compute the elements of these sets in time (computing using binary exponentiation). It is enough for us to prove that for all (this is sound with failure probability as long as is a PRF).
Now define the polynomials We now wish to show that (as polynomials) for all .
Polynomial divisibility protocol
We'll now show a protocol to prove polynomial divisibility; say we wish to show for some polynomials of degree . Let (as polynomials); we let the coefficients of (as elements of ) be public inputs. Let be a hash of all the inputs (including ). Then, we simply check that , using multiplications. By degree counting, as long as is a PRF, this is sound with probability .
Note that this protocol assumes no bounds on the coefficients of ; it works entirely in .
Remainder of the algorithm
The complexity of the polynomial divisibility protocol was the degree of the larger polynomial, so if we apply it naively, we would have complexity , which is too much.
Instead, create a binary tree on all the , and label each node with the least common multiple (as polynomials) of its descendants. For example, the node above will be labeled with the polynomial . The prover can compute the polynomials at each node and then provide all their coefficients as witnesses. Finally, for the proof, we execute a divisibility proof for every node and its children, and a divisibility proof that the top node divides . By soundness of the divisibility proofs, a valid proof necessarily implies that all of the divide . The total degree of all polynomials in the binary tree is at most , and the degree of is , so the total complexity of all the divisibility checks is , as desired.
Note that we can actually use the same value of for all of these divisibility checks (hashing all the inputs together), so we only take one hash for all this time. There are also probably ways to save constant factors by not actually computing the lcms.