Thursday, June 25, 2015

The percent of valid formulae in random strings

In the solution to the Arithmetic hide and seek problem, I mentioned a practical observation that among all strings composed of 18 characters "0123456789+-*/^k()" roughly 20% are valid formulas, and conjectured that it is true for arbitrarily long strings. This appears to be wrong: the percent gradually converges to 0.


Let, for some formal language, \(n_k\) is the number of words of length k. Then Z-transform of the sequence \(n_0,n_1,...\) is called its growth function \(f(z)\). (Apparently, it is common to use non-standard Z-transform by positive powers of z for growth function; I instead use the usual negative transform). For a regular language, that is set of strings, accepted by some finite automaton, growth functions are always rational and can be calculated by a simple formula:
$$f(z) = e(zI - A)^{-1}s,$$ where A is transition matrix, showing number of ways to get from state i to state j; e is row matrix of accepting states (1 for accepted 0 otherwise); and s is column matrix of initial states.

Unfortunately, formulas are not a regular language, because accepting automaton must have infinite number of states to balance braces. So instead of the exact count, I would estimate it by upper and lower bounds.

Lower bound

Lower bound is given by the automaton, that accepts valid formulas without braces.

Automaton, accepting formulas without braces. Blue are accepting states.

To better match grammar in my solution, it disallows leading zeros in integers. States are named by the last accepted character; blue color indicates accepting states. State "OP" is achieved after consuming a binary operator. If 5 states are ordered as [OP,0,1-9,0-9,K] then the transition matrix is:
$$A = \left[\begin{matrix} 1 & 5 & 5 & 5 & 5\\ 1 & 0 & 0 & 0 & 0 \\ 9 & 0 & 0 & 0 & 0\\ 0 & 0 & 10 & 10 & 0\\ 1 & 0 & 0 & 0 & 0 \end{matrix}\right]$$ which, with \(e=(0,1,1,1,1)\) and \(s^T=(1,0,0,0,0)\) gives growth function:
$$f_{lower}(z) = \frac{11z^2-20z}{z^3-11z^2-45z+100}.$$ After inverse transform, this gives an expression for number of valid strings of length k:
$$n_{lower}(k) = 0.623\cdot(-4.399)^k+0.02467\cdot{1.654}^k+0.598\cdot{13.745}^k. $$

Upper bound

To get upper bound, consider an automaton, accepting strings with unbalanced braces, only requiring that operator is not following an opening brace, and number not following a closing brace. Such strings are clearly a superset of all valid formulas.

Automaton, accepting formulas with unbalanced braces. Blue are accepting states, "op" is one of 5 binary operators "+-*/^".

Its transition matrix, for state ordering (OP, 0, 1-9, 0-9, ")", K) is:
$$A = \left[ \begin{matrix} 2 & 5 & 5 & 5 & 5 & 5\\ 1 & 0 & 0 & 0 & 0 & 0\\ 9 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 10 & 10 & 0 & 0\\ 0 & 1 & 1 & 1 & 1 & 1\\ 1 & 0 & 0 & 0 & 0 & 0 \end{matrix} \right]$$ which gives growth function:
$$f_{upper}(z) = \frac{11z-20}{z^3-13z^2-23z+80}$$ and the number of accepted strings is:
$$n_{upper}(k) = -0.6339*(-3.06)^k-0.003648 \cdot (1.8383)^k+0.6375 \cdot {14.2217}^k $$


Thus, the number of correct formulas grow not slower than \({13.745}^k\) and not faster than \({14.2217}^k\) where k is string length. Since total number of strings is \(18^k\), the percent of valid formulas is gradually decreasing with rate between \({0.7636}^k\) and \({0.7901}^k\). Moreover, \(k \approx \log_{18}t\), where t is current string index in shortlex ordering. Using this, asymptotic percentage bounds can be written as: \({0.7636}^{log_{18}t} = t^{-0.0933}\) and \(t^{-0.0815}\). Slow decrease rate explains why it was not apparent in the numeric test results.

Despite that my initial conjecture on percentage of valid formulas failed, the conclusion on growth rate is nevertheless correct, since number of formulas grows exponentially with length.

Note on Z-transform

Articles I have seen define growth function by means of non-standard Z-transform by positive powers of z:

$$f(z) = \sum_{i=0}^\infty a_i z^i.$$

Wikipedia calls it "Geophysical definition". I here took a regular definition applied in control systems and signal processing, since it allows me to use standard computer algebra tools. Its another advantage is that bases of the exponentials in expression for string count are just negated roots of the denominator polynomial. Traditional growth functions may be obtained by substituting \(z'=1/z\).