Sunday, April 1, 2012

Sampling the strings of a regular language

This week Adam focused on the generation of strings which match a specific regular expression. Several ways of approaching this has problem have been described in the literature. For example, two methods are given in the functional pearl Enumerating the strings of regular languages by M. Douglas McIlroy, but focus has been on the enumeration of all such strings, an approach that does not scale well. A practical alternative is to give up on the constraint that each string should be generated with equal probability. An approach taken by the tool Xeger is to generate the strings by transforming the regular expression into a non-deterministic finite automaton, and then to generate the string simulating the automaton by giving each option in a non-deterministic choice equal probability. This corresponds to a probabilistic automaton whose stochastic matrix consists of uniform distributions.