Complete Word Tours

Suppose we were to take a finite alphabet \mathbb{A}_n of n letters, say \{ 0, 1, \ldots, n-1\}, and we considered all the possible ‘words’ of a given length. For instance, the worlds of length 2 are:

\{00, 01, \ldots, 0(n-1), 10, 11, \ldots, (n-1)(n-1)\}

and there are n^2 many of these. Let’s denote the set of length-m words using the alphabet \mathbb{A}_n by \mathbb{W}^m_n.

What I am interested in now is “For a given n and m, what is the minimum length string of letters I can write down and still be sure that every word in \mathbb{W}^m_n appears somewhere in it as a substring?”. Why might I be interested in such a thing? Because I really should be working on my PhD, that’s why.

Example:

01100

contains every length-2 word of 2 letters, in order 01, 11, 10, 00. It’s of optimal length, too, because if we start with just ’01’ and successively add ‘1’, ‘0’, and ‘0’, we find that at every stage the newly added letter creates a new sub-word not previously seen, and so every element of \mathbb{W}^2_2 appears exactly once. We cannot delete a letter to make a shorter string without losing one of the words. Similarly,

0011101000

and

0201221100

are examples of optimal tours through \mathbb{W}^3_2 and \mathbb{W}^2_3 respectively, as they both have this property that each letter added after the first m adds a previously unseen word.

Notice I’ve introduced the terminology tour for any string \tau that contains all words in a given \mathbb{W}^m_n, with optimal tour being the tour \tau^m_n that is of minimum possible length for that m and n.

This property provides a lower bound to the length of an optimal tour through \mathbb{W}^m_n. It is impossible to do better than this ‘maximum density’ of words for reasons already discussed: suppose we were to build a tour via repeated selection of single characters, as if I were calling letters from our alphabet out to you and you were writing them down in a single string. If I called out a character that did not add a previously unseen word, then our string is already doing worse than the theoretical best, because some word appears at least twice (redundantly, just increasing our character count). On the other hand, if every letter I call out after the first m does add a new word, then I need exactly \| \mathbb{W}^m_n\| - 1 many characters after the first m, and any fewer necessarily misses a word out, so is not a tour.

Clearly, \mathbb{W}^m_n has n^m many elements, so the optimal tour of \mathbb{W}^m_n is at least n^m + m - 1 characters long. It is not guaranteed that such a tour exists for any given m and n, however, though it is always possible to take a valid tour. We can get an upper bound to the length of the optimal tour by simply noting that concatenating all the words in lexicographic order provides a tour, which is of length m n^m. Such a tour definitely possesses redundancies, since it begins with 000\cdots 00 followed by 000\cdots 01, which may be collapsed down by removing the middle 2(m-1) zeros.

Thus \exists \; \tau^m_n \: \forall m, n \in \mathbb{Z}^+, and

n^m + m - 1 \leq |\tau^m_n|\leq m n^m

is our first result, with equality on the right if and only if m = 1, the trivial case.


So how should we think about the problem of finding an optimal tour (and note that an optimal tour need not be unique – indeed, moving the last character to the front in any of the above examples also gives optimal tours)? It seems reasonable to be as ambitious as possible and see what results we can get relating to the absolute minimum-length tour. The process of building the tour letter by letter is a useful tool with which to get our foot in the mathematical door.

Imagine a rack of numbered balls; m balls can fit on the rack at any one time. We have a big pile of these balls to take from, which have labels from 0 to (n-1), and we are able to slide a ball into one end of the rack, but not remove it once it’s in there. If there are already m balls in the rack, then pushing another one in the front end makes the last one fall out of the back end – we have a process of choosing letters that forgets all but the last m of them. Our task is to add balls to the rack until every possible combination of labels has appeared at least once. Of course we could just brute-force our way through every combination one by one, but the balls are heavy and hard to lift (oh no! Why did we imagine them to be so heavy?!), so ideally what we would like is some clever method of choosing them to minimise the amount of work we must do.

What we want is a directed graph of n^m many vertices, labelled by words from \mathbb{W}^m_n. An edge is drawn from one word v to another w if the last (m-1) letters of v match the first (m-1) letters of w. In our ball rack analogy, this means that if the rack is showing the word v, then it is possible to make the rack show the word w by the addition of only one ball. A tour amounts to some path through the graph that visits every vertex at least once. An optimal tour is therefore a path through the graph that visits every node at least once in the minimal number of edge traversals. A path that visits every vertex exactly once would be the perfect, minimum-length tour, and it would involve n^m - 1 edge traversals (as per our earlier discussion). Such a tour is called a hamiltonian path.

Words as a digraph and the hamiltonian path through them.

W^2_2 and its hamiltonian path

But what is the structure of this digraph in general? Is it hamiltonian? If we could show that it is, then it follows that there is always a perfect minimal length tour for every \mathbb{W}^m_n. It’s quite hard to show that a graph is hamiltonian – this is related to the travelling salesman problem. Maybe there is another way. Clearly the graph has some nice symmetry, as tours should be invariant under permutations of the letters. If we have a think about what happens when a letter is added to the rack and do a little extra leg-work now, it is possible to reveal additional structure that lets us treat the problem in a simpler way. Since we are thinking about cycles in graphs, it is much easier to deal with eulerian paths, because these have much stronger theorems surrounding them. But eulerian paths are paths that use each edge exactly once, not each vertex. Can we transform this vertex problem into one about edges? We want to use

Theorem

A directed graph has an eulerian cycle if and only if every vertex has equal in degree and out degree and all of its vertices with nonzero degree belong to a single strongly connected component.

Firstly, let’s collect the nodes into groups. A word v of length m can be split into a world w of length (m-1) and a single letter a on the end. We write v as [w, a]. Let [w, \cdot] denote the set of words which have this form (for this particular w). This is the set \{[w, 0], [w, 1], \ldots, [w, (n-1)]\}. We can also split the words the other way, with a single letter at the start: [\cdot, u] = \{[0, u], [1, u], \ldots, [(n-1), u]\}. Clearly, each word belongs to exactly one [w, \cdot] and exactly one [\cdot, u]. There are as many of these ‘cosets’ as there are words of length (m-1), and each coset has n elements.

Following the edges of the graph, each letter k defines a map k: [\cdot, u] \to [u, \cdot] \: \forall u \in \mathbb{W}^m_n whereby k([\cdot, u]) = \{[u, k]\}, just like sliding a ball labelled ‘k’ into the rack and pushing out the oldest one. Together, we see that all the letters define a family of maps, whereby any one [a, u] maps exactly on to all of the [u, \cdot]. But [a, u] = [w, b] belongs to some [w, \cdot]. So from [w, \cdot] it is possible to reach all of [u, \cdot] in exactly one move, provided we begin with [a, u] = [w, b]. But by the symmetry of the argument, if it is possible to enter [w, \cdot] at all from some vertex, then all of [w, \cdot] is attainable from that same vertex, so in particular [w, b] is. Another important fact to convince yourself of is that two distinct elements [w, c], [w,d] \in [w, \cdot] map onto distinct cosets [u_c, \cdot], [u_d, \cdot].

Paths through 01212

Example: All the paths through 01212 in \mathbb{W}^5_4, with elements collected into ‘cosets’.

Hence, if we arrive in [u, \cdot] via [w, \cdot], then we know that we must have done so via the word [a, u] = [w, b] \in [w, \cdot]. Therefore we collect all the edges exiting [w, \cdot] and entering [u, \cdot] into one single edge and label it [w, b]. Since we can reach anywhere in [u, \cdot], following this edge does not restrict in any way which edges we are allowed to take out of [u, \cdot]. Therefore let us create a new directed graph, where there are n^{(m-1)} vertices labelled by the cosets [w, \cdot], and an edge connects two cosets if there exists mappings in the way described above. Each edge corresponds to the addition of exactly one (undetermined) letter to the rack and is labelled by a word in \mathbb{W}^m_n – this labelling represents the fact that in order to be able to follow this edge, the balls in the rack must be showing the labelled word.

Therefore the question of whether the perfect minimal tour is possible for a given \mathbb{W}^m_n has become the question of whether there exists a path through this new digraph which uses every edge exactly once. This is a question we can soon answer.

There are exactly n distinct edges leaving the vertex [w, \cdot]. How many are entering? Each [a, w] \in [\cdot, w] belongs to a distinct [v_a, \cdot], so each of these provides a distinct edge entering [w, \cdot], and any word with this property must necessarily be expressible as one of the [\cdot, w]. Therefore there are n distinct edges entering [w, \cdot], so each vertex has in degree and out degree equal to n.

Can you see the eulerian paths through each?

Example: Reduced graphs for \mathbb{W}^2_4 and \mathbb{W}^3_2. Each edge corresponds to a particular word appearing in the tour. The first graph is complete, the second is equivalent to that of \mathbb{W}^2_2 above.

Finally, the coset graph is very clearly seen to be strongly connected, as any pair of words in the original graph are strongly connected. Beginning with any word v and picking any other target word w = w_1 w_2 \cdots w_m, taking edges w_1, w_2, up to w_m from v results in w. Thus by extension, any pair of cosets are strongly connected.

Therefore the conditions of the theorem are satisfied; the coset graph has an eulerian path (in fact, a cycle). Therefore, finally, every \mathbb{W}^m_n has a perfect tour – a string that contains all its n^m words as substrings in the theoretical minimum of n^m + m - 1 characters – the optimum density, with every length-m substring giving a different word. Result!


Can I give a concrete procedure for producing optimal tours? In the case of m=2, I certainly can (the case m = 1 being clearly trivial). Here is the algorithm for \mathbb{W}^2_n:

  1. Set x to be the lowest number not already used as x. Add “x” to the string. If such an x doesn’t exist, go to 4.
  2. Set y_x to be the highest number greater than (x+1) and not already used as y_x. If such a y_x doesn’t exist, go to 1.
  3. Add “y_x x” to the string. Go to 2.
  4. Add “(n-1)” and then “i i ” for all i = (n-2), \ldots, 1, 0 to the string, in descending order.
  5. Stop; program complete.

We leave it as an exercise to verify that this reproduces the examples 01100 and 0201221100 for \mathbb{W}^2_2 and \mathbb{W}^2_3 respectively – in fact, it’s how I generated them. Continuing on, we get for \mathbb{W}^2_4 the tour 03020131233221100 – you can check this yourself to see if all 16 of these words appear in it (hint: they do) and that it truly has optimal length (hint: it does).


I am going to add pictures to this soon to make the graph arguments clearer. If it’s appreciably past Christmas of 2014, write a comment complaining. All done – 17/11/14.

If you have any ideas for algorithms for generating optimal tours for m > 2, please don’t hesitate to suggest them – or if you have any general comments on the problem at all! I’d love to hear them!

UPDATE16/10/14 – By the way, I have recently found out that these tours are more commonly known as de Bruijn sequences, so go and learn about them (my eulerian path technique is a known existence proof)! An algorithm for generating a tour through W^m_n is to take all Lyndon words in n characters and concatenating, in lexicographic order, all those whose length divides m. Fantastic!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s