Fun With Sequences

I like sequences that have non-conventional definitions. For example, there is the very non-equational Look and Say Sequence made famous by Conway:

$\{1, 11, 21, 1211, 111221, 312211, 13112221, 1113213211, 31131211131211, \ldots \}$

This starts with the seed value 1, and each successive term is generated by looking at the previous term, saying the numbers appearing there out loud, and writing down the numbers you find yourself saying. For instance, the term ‘111221’ becomes “Three ones, two twos, one one”, which transliterates to ‘312211’.

Despite this crazy generating rule, there is actually a lot of structure to be found in this sequence. Conway found that certain strings of digits, once created, never interacted with those to their left or right again, instead going through an internal ‘life cycle’, growing and changing until it reached a point where it was a string of several such atomic strings joined together; each of these then went off in their own life cycle like some strange numerical mitosis. Conway actually named these atomic strings after the elements, since he found 92 such atomic strings containing the numbers 1, 2, and 3 alone, and two ‘transuranic’ strings for each other natural number.

Conway also found that the ratio of the length of successive terms approaches a constant, and gave a degree-71 polynomial of which this constant is the only real root.

The Look and Say Sequence is surprisingly fruitful, given how non-mathematical its rule seems.

So I have a sequence that has an interesting semi-strange-sounding rule. To generate the next term in the sequence, we take the previous term and find the smallest positive integer that doesn’t appear somewhere within it as a substring; the next term is then the previous one with this integer attached to the end.

For instance, if we begin with 0, the first few terms are:

$\{0, 01, 012, 0123, 01234, 012345, 0213456, 01234567, 012345678, 023456789, 012345678910, 01234567891011, 0123456789101113, \ldots\}$

which if you look carefully you will see just stacks the numbers 0 to 11 up next to each other, but then misses out 12 and instead attaches 13 on the end, because 12 already appears as the second and third digits of the previous term. For this reason, 34, 45, 56, and so on up to 91 will be some of the other two-digit numbers missed out, and 123, 234, 345, etc, will also be skipped when we get to them.

It’s actually easier to see interesting behaviour if we pick a smaller base (since of course in base $b$ the first $b$ many digits have never appeared before, so you have to go at least that many terms in before you see anything nontrivial). So let’s look at the sequence in base 2, starting with the seed term 0.

$\{0, 01, 0110, 0110100, 0110100111, 01101001111000, 011010011110001011, 01101001111000101110000, \ldots\}$

Of course, the burning question that immediately springs to mind is how often are integers added? How common are the chosen integers, compared to all integers? First of all we immediately see that every power of the base (i.e. every $100\ldots 00|_b$) is always added, as there is no way for a string of exactly that many zeros to appear any earlier – no integer has a leading zero, and every smaller integer has strictly fewer digits. This would seem to suggest that the smaller the base, the more frequently integers are added. But wait! We don’t know anything about what happens between powers of the base yet, and for larger bases there are more numbers between each power, and correspondingly more chances for numbers to be added.

We can study this in a less ungainly way if we instead take the sequence of added integers. Given a seed and a base, the integers that are added uniquely determine the original sequence, and vice versa. Therefore we can study exactly the same information without having sequence terms grow to long too quickly. Compare, for instance, the first hundred integers added to the sequence of seed 0 in base ten and base two:

The first 100 integers added in base 10 and base 2 to the seed term 0.

Guess which is which. Perhaps surprisingly, blue corresponds to base 10 and red to base 2. Actually this is not so surprising, because there are five times as many symbols in use in base 10 compared to in base 2, so as a rough estimate, it is in principle five times less likely for a given number to have occurred before. Indeed, taking this to extremes, in ‘base $\aleph_0$‘ where every integer gets its own symbol, this sequence is rather tedious as it just ends up being an ordered list of all the integers. While the integers added in base 10 begin

$\{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, \ldots\}$,

the base 2 ones begin

$\{1, 2, 4, 7, 8, 11, 16, 18, 21, 22, 25, 29, 31, 32, \ldots \}$.

Are these graphs roughly linear, or are they just very young, linear-looking power laws? Or exponentials? It’s hard to tell. One thing to notice is that the graphs are very nearly piecewise exponential, where by this I mean that there are long smooth sections which ‘scoop’ up steeper and steeper until they hit the next power of the base, where there is a corner and a new flatter smooth section begins. This is very obvious at $y = 100$ for the blue curve, and even more obvious when we compare larger numbers of terms:

The first 10000 terms added in base ten and two.

Not how similar these graphs are to the first one, which corresponds to a region near the origin magnified 100 times! Some of the jitteriness of the base 2 graph (red) has been averaged out by large numbers and now the smooth segments are more obvious.

Something we can use to estimate how frequently numbers are added is to look at the density of these numbers within each power of the base, which is the proportion of digit strings of a given length which are added. This density is always at least $b^{-n}(b-1)^{-1}$ for strings of length $(n+1)$, since we know that at least one string of length $(n+1)$ is always added (the string $100\ldots 00|_b$ of $n$ zeros). The question is whether this proportion goes to zero, whether it converges to some nonzero proportion, or whether it oscillates about some region. If the nearly linear behaviour of these graphs is a general trend and not just transient, then it would appear that the proportion of integers added approaches a constant. If the $n$th added integer is approximately $\lambda \times n$ for some slope $\lambda > 1$ (why this restriction?) then the proportion of integers added is $\frac{1}{\lambda}$ (true for any base). If the graph is not linear and instead leans back further and further, then the proportion must approach zero.

So I ran to a large number of terms (20000, I tried up to 50000 but I ran out of memory). I counted up the number of added integers between each power of the base – of course I’m working in base 2 to get up the maximum number of powers – and divided each by the total number of integers in that range to get the proportion.

Proportion of added integers between each power of the base.

Making it all the way up to $2^{17}$ was a herculean effort on the part of my processor, especially since I was not so picky about memory and was actually storing a whole range of relevant info in my script. Anyway, this graph looks to be decaying slowly and smoothly to zero after a jumpy start. It looks like $\frac{1}{x}$, and in fact agrees with it very well:

Added integer proportions compared to $y = \frac{1}{x}$ (red).

Whether or not this happens in other bases I don’t know, since I would have to compute so many more terms to get as many data points. Who knows whether or not this is a coincidence and only appears at small powers of the base? It seems nice, but it’s only 17 terms. Caution.

That wraps this little bit up. I might look at this sequence again soon, I’m going to go into its Shannon entropy and see how often various strings are represented. We could look at which digit is more common, and so on.