Suppose we were to take a finite alphabet of letters, say , and we considered all the possible ‘words’ of a given length. For instance, the worlds of length 2 are:
and there are many of these. Let’s denote the set of length- words using the alphabet by .
What I am interested in now is “For a given and , what is the minimum length string of letters I can write down and still be sure that every word in appears somewhere in it as a substring?”. Why might I be interested in such a thing? Because I really should be working on my PhD, that’s why.
contains every length-2 word of 2 letters, in order 01, 11, 10, 00. It’s of optimal length, too, because if we start with just ’01’ and successively add ‘1’, ‘0’, and ‘0’, we find that at every stage the newly added letter creates a new sub-word not previously seen, and so every element of appears exactly once. We cannot delete a letter to make a shorter string without losing one of the words. Similarly,
are examples of optimal tours through and respectively, as they both have this property that each letter added after the first adds a previously unseen word.
Notice I’ve introduced the terminology tour for any string that contains all words in a given , with optimal tour being the tour that is of minimum possible length for that and .
This property provides a lower bound to the length of an optimal tour through . It is impossible to do better than this ‘maximum density’ of words for reasons already discussed: suppose we were to build a tour via repeated selection of single characters, as if I were calling letters from our alphabet out to you and you were writing them down in a single string. If I called out a character that did not add a previously unseen word, then our string is already doing worse than the theoretical best, because some word appears at least twice (redundantly, just increasing our character count). On the other hand, if every letter I call out after the first does add a new word, then I need exactly many characters after the first , and any fewer necessarily misses a word out, so is not a tour.
Clearly, has many elements, so the optimal tour of is at least characters long. It is not guaranteed that such a tour exists for any given and , however, though it is always possible to take a valid tour. We can get an upper bound to the length of the optimal tour by simply noting that concatenating all the words in lexicographic order provides a tour, which is of length . Such a tour definitely possesses redundancies, since it begins with followed by , which may be collapsed down by removing the middle zeros.
Thus , and
is our first result, with equality on the right if and only if , the trivial case.
So how should we think about the problem of finding an optimal tour (and note that an optimal tour need not be unique – indeed, moving the last character to the front in any of the above examples also gives optimal tours)? It seems reasonable to be as ambitious as possible and see what results we can get relating to the absolute minimum-length tour. The process of building the tour letter by letter is a useful tool with which to get our foot in the mathematical door.
Imagine a rack of numbered balls; balls can fit on the rack at any one time. We have a big pile of these balls to take from, which have labels from 0 to , and we are able to slide a ball into one end of the rack, but not remove it once it’s in there. If there are already balls in the rack, then pushing another one in the front end makes the last one fall out of the back end – we have a process of choosing letters that forgets all but the last of them. Our task is to add balls to the rack until every possible combination of labels has appeared at least once. Of course we could just brute-force our way through every combination one by one, but the balls are heavy and hard to lift (oh no! Why did we imagine them to be so heavy?!), so ideally what we would like is some clever method of choosing them to minimise the amount of work we must do.
What we want is a directed graph of many vertices, labelled by words from . An edge is drawn from one word to another if the last letters of match the first letters of . In our ball rack analogy, this means that if the rack is showing the word , then it is possible to make the rack show the word by the addition of only one ball. A tour amounts to some path through the graph that visits every vertex at least once. An optimal tour is therefore a path through the graph that visits every node at least once in the minimal number of edge traversals. A path that visits every vertex exactly once would be the perfect, minimum-length tour, and it would involve edge traversals (as per our earlier discussion). Such a tour is called a hamiltonian path.
But what is the structure of this digraph in general? Is it hamiltonian? If we could show that it is, then it follows that there is always a perfect minimal length tour for every . It’s quite hard to show that a graph is hamiltonian – this is related to the travelling salesman problem. Maybe there is another way. Clearly the graph has some nice symmetry, as tours should be invariant under permutations of the letters. If we have a think about what happens when a letter is added to the rack and do a little extra leg-work now, it is possible to reveal additional structure that lets us treat the problem in a simpler way. Since we are thinking about cycles in graphs, it is much easier to deal with eulerian paths, because these have much stronger theorems surrounding them. But eulerian paths are paths that use each edge exactly once, not each vertex. Can we transform this vertex problem into one about edges? We want to use
A directed graph has an eulerian cycle if and only if every vertex has equal in degree and out degree and all of its vertices with nonzero degree belong to a single strongly connected component.
Firstly, let’s collect the nodes into groups. A word of length can be split into a world of length and a single letter on the end. We write as . Let denote the set of words which have this form (for this particular ). This is the set . We can also split the words the other way, with a single letter at the start: . Clearly, each word belongs to exactly one and exactly one . There are as many of these ‘cosets’ as there are words of length , and each coset has elements.
Following the edges of the graph, each letter defines a map whereby , just like sliding a ball labelled ‘k’ into the rack and pushing out the oldest one. Together, we see that all the letters define a family of maps, whereby any one maps exactly on to all of the . But belongs to some . So from it is possible to reach all of in exactly one move, provided we begin with . But by the symmetry of the argument, if it is possible to enter at all from some vertex, then all of is attainable from that same vertex, so in particular is. Another important fact to convince yourself of is that two distinct elements map onto distinct cosets .
Hence, if we arrive in via , then we know that we must have done so via the word . Therefore we collect all the edges exiting and entering into one single edge and label it . Since we can reach anywhere in , following this edge does not restrict in any way which edges we are allowed to take out of . Therefore let us create a new directed graph, where there are vertices labelled by the cosets , and an edge connects two cosets if there exists mappings in the way described above. Each edge corresponds to the addition of exactly one (undetermined) letter to the rack and is labelled by a word in – this labelling represents the fact that in order to be able to follow this edge, the balls in the rack must be showing the labelled word.
Therefore the question of whether the perfect minimal tour is possible for a given has become the question of whether there exists a path through this new digraph which uses every edge exactly once. This is a question we can soon answer.
There are exactly distinct edges leaving the vertex . How many are entering? Each belongs to a distinct , so each of these provides a distinct edge entering , and any word with this property must necessarily be expressible as one of the . Therefore there are distinct edges entering , so each vertex has in degree and out degree equal to .
Finally, the coset graph is very clearly seen to be strongly connected, as any pair of words in the original graph are strongly connected. Beginning with any word and picking any other target word , taking edges , , up to from results in . Thus by extension, any pair of cosets are strongly connected.
Therefore the conditions of the theorem are satisfied; the coset graph has an eulerian path (in fact, a cycle). Therefore, finally, every has a perfect tour – a string that contains all its words as substrings in the theoretical minimum of characters – the optimum density, with every length- substring giving a different word. Result!
Can I give a concrete procedure for producing optimal tours? In the case of , I certainly can (the case being clearly trivial). Here is the algorithm for :
- Set to be the lowest number not already used as . Add “” to the string. If such an doesn’t exist, go to 4.
- Set to be the highest number greater than and not already used as . If such a doesn’t exist, go to 1.
- Add “” to the string. Go to 2.
- Add “” and then “” for all to the string, in descending order.
- Stop; program complete.
We leave it as an exercise to verify that this reproduces the examples 01100 and 0201221100 for and respectively – in fact, it’s how I generated them. Continuing on, we get for the tour 03020131233221100 – you can check this yourself to see if all 16 of these words appear in it (hint: they do) and that it truly has optimal length (hint: it does).
I am going to add pictures to this soon to make the graph arguments clearer. If it’s appreciably past Christmas of 2014, write a comment complaining. All done – 17/11/14.
If you have any ideas for algorithms for generating optimal tours for , please don’t hesitate to suggest them – or if you have any general comments on the problem at all! I’d love to hear them!
UPDATE – 16/10/14 – By the way, I have recently found out that these tours are more commonly known as de Bruijn sequences, so go and learn about them (my eulerian path technique is a known existence proof)! An algorithm for generating a tour through is to take all Lyndon words in characters and concatenating, in lexicographic order, all those whose length divides . Fantastic!