30

So, a while back I read a joke that went something like this:

"Never compute pi in binary - because it goes on infinitely and is random, it theoretically contains every finite bit string. So, you will then possess all copyrighted material in existence and be liable for some serious fines."

This is obviously meant to be humorous, but it got me thinking. If every finite bit string exists in a binary representation of pi, would it be possible to use this as a method of transmitting data?

For example, let's say I wanted to transmit a bit string that could be interpreted as an jpeg image. Instead of sending the information directly, I would find its location within the digits of pi, and simply send the location of the first bit within the digits of pi, as well as the lengths of the string.

This seems pretty straightforward to me, but the obvious hurtle here is that the probability of finding this string within even the first several trillion digits is remarkably small. So, it could end up taking an immense amount of time to find.

My thinking is that several machines could be dedicated to searching for large files within pi, and then creating an index of all of their start locations. So, each computation would only need to occur once and then that information could be transmitted extremely quickly from then on.

So, what do you think? Is this at all feasible, or would these computations take far too much time?

Thanks for reading! I apologize if I have overlooked any posting guidelines, this if my first question in this forum.

EDIT:

Thanks for your quick responses, folks! I figured there was error in my reasoning, nice to know why!

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Chris Staikos
  • 1,150
  • 10
  • 24
  • 3
    Only 10 trillion digits of Pi are known. That's `10^13`. It's unlikely that you'll find any random string longer than 13 digits in that. – Mysticial Jul 06 '12 at 18:40
  • 3
    You seem to be missing an important concept of "information entropy". Even if you had infinite digits of Pi. The address containing XXX data will be at least as long as XXX itself. – Mysticial Jul 06 '12 at 18:45
  • @Mystical - Not sure I agree with your comments. Your first is speaking about base 10 digits, not binary. Your second comment can be disproven by saying that I may have found where my bit string exists within pi and it (somehow or other) begins at bit 5 within PIs bit string. My string that I need to compare was 1000000 bits long. The address I send is 5 not 1000000. – trumpetlicks Jul 06 '12 at 18:50
  • 2
    @trumpetlicks It doesn't matter whether it's binary or decimal. They can encode each other. It's about information entropy. – Mysticial Jul 06 '12 at 18:53
  • 1
    @trumpetlicks I'm saying that the compression is not possible, because the address will be as large as the data itself. – Mysticial Jul 06 '12 at 19:04
  • @trumpetlicks: "encode 2^n start positions using just n bits": Well, if you just write the string in plaintext, you get the same result. And you can't expect to encode all strings of length `n` in a string less than `2^n` in length. – jpalecek Jul 06 '12 at 19:07
  • @trumpetlicks It's not one-to-one. It's just probability and law of large numbers. If your data just happens to be Pi itself, then the start address is zero. So that's a case where it is possible to compress. But in general, no it will not compress significantly. – Mysticial Jul 06 '12 at 19:15
  • @trumpetlicks Don't forget that the starting location of N-bits of data will increase exponentially - thereby canceling out that `log(n)`. – Mysticial Jul 06 '12 at 19:28
  • @trumpetlicks Read the SSN example in my answer. I use a 9-digit SSN and the starting address is also about 9 digits. If you use an N-digit SSN, the starting address will also be about N-digits. – Mysticial Jul 06 '12 at 19:48
  • 1
    @trumpetlicks Take my last comment, replace every instance of the word "digit" with "bit" and it will still hold. It will hold in all bases in fact. – Mysticial Jul 06 '12 at 21:30
  • @cstaikos: Even though Mysticial and others explained why this is infeasible, I like that you even had the idea. +1 from me. – John Y Jul 06 '12 at 21:59
  • 1
    Within the right parameters, the general idea can be made to work. Pi probably isn't the ideal sequence, but fractal image compression is based on storing/transmitting a function and parameters to create something similar to the original data (though it's usually just "similar", so it's lossy compression). – Jerry Coffin Jul 06 '12 at 22:38
  • @JerryCoffin: But even if you found "the" perfect sequence of 1s and 0s, you're still going to transmit at _least_ as much information as the data itself on average. – Mooing Duck Jul 07 '12 at 00:21
  • 2
    @MooingDuck: Yes and no. Across all possible input strings, yes. But the inputs people care about *mostly* contain some degree of structure and redundancy, not pure entropy -- and in that case, compression is not only possible, but often works pretty well. – Jerry Coffin Jul 07 '12 at 01:16
  • 1
    You guys seem to all be missing the stronger point: just because Pi happens to be irrational (and transcendental) does _not_ mean every n-bit string can be found at all in its binary expansion (or any other base for that matter). For all you know there is one particular 70-bit string that simply never occurs. The nth bit of Pi is _not_ a random variable. – Thomas Jul 07 '12 at 09:13
  • See also [Magic Function Theory](http://www.dogma.net/markn/FAQ.html#Q19) discussion. Actually, I suggest reading all of [comp.compression faq](http://www.faqs.org/faqs/compression-faq/) if you find this sort of thing interesting. It covers a lot of popular, wrong theories. Though just understanding the pigeonhole principle is enough to debunk most of the more popular ideas. – Brian Jul 09 '12 at 20:08

5 Answers5

48

Expanding on my comments. There's a very important concept here that's called information entropy.

Out of full disclosure, I'm the current world record holder of the digits of Pi at 10 trillion digits (10^13).

I have approximately 10,000 copies of everyone's social security number.

However that doesn't mean I can just hack into everyone's accounts and steal their identities. Because I don't know where each person's SSN starts. And for a typical 9-digit SSN, the first digit in Pi where that SSN will appear will be on the order of 9 digits long. In other words, the information about the SSN is kept in the address rather than in Pi itself.


For example, if someone has the SSN: 938-93-3556

It starts at offset 597,507,393 in Pi. That number 597,507,393 is about as long as the SSN itself. In other words, we've gained nothing by using Pi.
(I'm not sure if there's an earlier offset where it appears, but the probability decreases exponentially with smaller offsets.)


To generalize this, even if you had infinite digits of Pi (which theoretically holds all possible information), the address that holds data XXX will (with extreme probability) be as large as XXX itself.

In other words, the information is not held in the digits of Pi itself, but rather the address where the information starts.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • This is a sweet answer. The comparison with SSNs is particularly compelling – sehe Jul 06 '12 at 19:12
  • But you can't guarantee that you have any subsequence of digits in there. That is, without actually searching for a particular SSN in the sequence, you can't guarantee that its in there. – Rafael Baptista Jul 06 '12 at 19:19
  • 2
    @RafaelBaptista: true, but that's not the point Mystical is making. What he's saying is transferring the offset/index will (on average) take _more_ data than transferring the data itself. – Mooing Duck Jul 06 '12 at 19:21
  • I agree about information entropy. For the same reason that you can't compress an arbitrary message down below a certain size. I'm just saying that it is also true that even if you wanted to use an offset instead of the original message there is no guarantee that for an arbitrary message that such an offset exists. – Rafael Baptista Jul 06 '12 at 19:23
  • @cstaikos [melak47](http://stackoverflow.com/users/996886/melak47) calculated the offsets for all the four digits numbers, and found that for 6300/10000, the index took more space than the data itself. That's more than half, leading to a net loss. – Mooing Duck Jul 06 '12 at 19:52
  • 1
    @trumpetlicks: In base 10, it takes more space on average.. In base 2, it also takes more space, _by exactly the same proportion_. And same with every single other base, by exactly the same proportion. In all bases, it takes more space on average to encode the offset than it does the data itself. – Mooing Duck Jul 06 '12 at 21:26
  • 1
    @Rafael: Actually, if we assume [certain randomness properties](http://en.wikipedia.org/wiki/Normal_number) hold for pi *(it is believed they do)*, we can guarantee that, with probability 1, **every** finite sequence of digits, no matter how long, will eventually appear in the digits of pi. – BlueRaja - Danny Pflughoeft Jul 07 '12 at 01:07
  • @BlueRaja: Right. It is not known if PI is a normal number, so it cannot be proven that it will eventually contain every possible sequence. – Rafael Baptista Jul 07 '12 at 18:29
  • 1
    @trumpetlicks: Why are you talking about decimal digits? For the numbers in Mystics' example, you can hold the index in 30 bits, or the value itself in 30 bits. The _value_, not the decimal representation. – Mooing Duck Jul 08 '12 at 16:58
  • @trumpetlicks: no real binum library uses decimal characters, that's absurd. And everyone but you is talking about the binary representation. Mystics and I were simply using base 10 samples because they're easier, but they same concepts apply to base 2. – Mooing Duck Jul 08 '12 at 21:41
  • 1
    @trumpetlicks: As I said, nobody stores the digits of PI that way, because no bignum works that way. Every bignum I've worked with uses base 4294967296 on the inside, for speed. – Mooing Duck Jul 08 '12 at 21:44
  • 1
    Maybe I should pull out a hexadecimal SSN example. But other than that, that's enough for me. This conversation is getting unnecessarily long - especially if there's only one person who's having trouble grasping the concept of information theory. – Mysticial Jul 08 '12 at 21:46
  • @Mystical - Apologies, I don't mean to be dragging this on, I understand number theory, and have built most of my own big num library to boot. I have in designing comm. protocols MANY times used lookup tables where I send an index into that table. IT HAS ALWAYS BEEN FASTER. If we look at PI as simply THAT lookup table, the problem is exactly what I have answered. The struggle here is not with information entropy, it is with PI and it's availability of the values one wishes within it's indexable set. I do THANK YOU for you comments, I am here to learn as much as anything :-) +1 – trumpetlicks Jul 08 '12 at 22:03
23

Because we were all kind of bored in the Lounge<C++> I went ahead and implemented a search to find out the average offsets of 'messages' of specific lengths.

I downloaded 1 million digits of Pi and looked for all subsequences of fixed length (e.g. 00..99). Depending on the message lenght, you get the following outputs:

 Digits    Avg.Offset    Unfound

 1            8.1        0
 2          107.07       0
 3          989.874      0
 4         9940.46       0
 5        99959.4        8 <-- note

Note that at 10% of the number of digits of pi searched, we start hitting unfound patterns, already.

Note also, that, as predicted by the laws of information entropy, the average offset is roughly proportional to the length of the message.


Raw output and timings:

Running

for a in 10 100 1000 10000 100000; do \make -B CFLAGS=-DNUMRANGE=$a && time ./test; done

Shows

g++ -DNUMRANGE=10 -std=c++0x -g -O3 -fopenmp -march=native test.cpp -o test && ./test
0 unfound
81 cumulative, 8.1 average

real    0m0.008s
user    0m0.008s
sys 0m0.004s
g++ -DNUMRANGE=100 -std=c++0x -g -O3 -fopenmp -march=native test.cpp -o test && ./test
0 unfound
10707 cumulative, 107.07 average

real    0m0.004s

g++ -DNUMRANGE=1000 -std=c++0x -g -O3 -fopenmp -march=native test.cpp -o test && ./test
0 unfound
989874 cumulative, 989.874 average

real    0m0.010s

g++ -DNUMRANGE=10000 -std=c++0x -g -O3 -fopenmp -march=native test.cpp -o test && ./test
0 unfound
9.94046e+07 cumulative, 9940.46 average

real    0m0.081s

g++ -DNUMRANGE=100000 -std=c++0x -g -O3 -fopenmp -march=native test.cpp -o test && ./test
8 unfound
9.99594e+09 cumulative, 99959.4 average

real    0m7.387s

Full code, makefile and pi digits: https://gist.github.com/3062541

Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • 9
    @trumpetlicks: The base is arbitrary. All the reasons it won't work in base 10 are _exactly_ the same reasons it won't work in base 2, except you multiply all the averages by 3.32192809. – Mooing Duck Jul 06 '12 at 21:24
  • @trumpetlicks clarifying: `3.32192809... ~= ln(10)/ln(2)` – sehe Jul 06 '12 at 21:27
  • 2
    +1 for being bored enough to try this. I was taking a nap since I woke up too early this morning. :P – Mysticial Jul 06 '12 at 21:34
  • @Mysticial Pssh what are you talking about *being bored enough*? +1 for *awesomeness*. – Drise Jul 06 '12 at 21:37
6

No, it's not possible to efficiently find an arbitrary sequence in a random sequence -- that follows from the definition of "random." (If there were a way to predict where the sequence occurred, it wouldn't be random.)

As for indexing all the locations, well, what have you gained? You're essentially saying "Jump to starting point 0..." and then you have to say either "...and then calculate the next JPEG-sized bits in π..." (no win, since you have to use up energy doing the calculation) or "... and then lookup the next JPEG-sized chunk of data in the mega-π index." (In which case you could just, y'know, load the JPEG file.)

You can't win and you can't break even (and, for what it's worth, you can't get out of the game).

UPDATE: @Mysticial's answer is better than mine. His point

For example, if someone has the SSN: 938-93-3556

It starts at offset 597,507,393 in Pi. That number 597,507,393 is about as long as the SSN itself. In other words, we've gained nothing by using Pi.

elegantly captures the fundamental problem.

Community
  • 1
  • 1
Larry OBrien
  • 8,484
  • 1
  • 41
  • 75
  • 2
    Funny. This seems to center about the feasibility of finding the offset. The thing is, if you'd ignore that, you'd run into the real problem: you'd lose information efficiency by representing the offset (instead of the actual data). – sehe Jul 06 '12 at 19:20
  • @Mysticial Yes, your answer is better than mine which was just a flip response. – Larry OBrien Jul 06 '12 at 19:26
  • *"it's not possible to efficiently find an arbitrary sequence in a random sequence"* - while this is true, pi is not a random sequence. It's possible, for instance, to efficiently calculate the nth digit of pi without having to calculate the digits preceeding it. It's certainly feasible that some number-theorist could come up with an efficient way to locate the first position of an arbitrary bitstring in pi... but *(besides being outside the scope of this site)* that doesn't give us a way to compress that bitstring, for the entropy-reasons mentioned by @Mystical. – BlueRaja - Danny Pflughoeft Jul 07 '12 at 01:01
  • actually it is possible to to that. that's how dna seq indexing is done. – Erik Aronesty Sep 18 '15 at 15:03
5

That statement is wrong. Pi is infinite, and its next digit is unpredictable, but that doesn't mean that every possible string is in there.

For example, suppose I create a function similar to pi, but any time there is a sequence of 20 binary zeroes, calculate the next 20 bits, and replace the zeroes with it.

That sequence is also infinite, and unpredictable - but we can know with certainty that it never contains a sequence of 20 binary zeroes.

There is no way to prove that PI contains every possible sequence of bits.

This might also help answer it: http://www.youtube.com/watch?v=8PUJvAlD64k

Rafael Baptista
  • 11,181
  • 5
  • 39
  • 59
  • Sorry, but if we can compute 10^13 digits and, in principle, can compute any digit of Pi (given enough time/memory), these digits cannot be described as "unpredictable". However, if a "random" sequence doesn't contain certain fixed string, it can't be random (wrt. Turing machine; there are several definitions, but they aer largely equivalent). – jpalecek Jul 06 '12 at 19:04
  • They are unpredictable in that you can't predict what the n'th digit of PI is without computing pi to that number. – Rafael Baptista Jul 06 '12 at 19:15
  • 1
    Actually there's a spigot algorithm for computing binary digits of pi without computing the preceding digits (the BPP formula is somewhat well known). In addition, I'm not sure how you're concluding that we can't prove pi contains every possible sequence of bits (whether it does is currently unknown). – Nabb Jul 07 '12 at 04:25
  • I'm not saying that I can prove it doesn't. I'm just saying that infinite non-repeating pseudo random number sequences like pi cannot be proven to have every sequence. – Rafael Baptista Jul 07 '12 at 18:27
  • i agree with @nabb - saying something cannot be proven is a pretty strong claim. – Erik Aronesty Sep 18 '15 at 15:04
  • How can I word this point to get past you math lawyers? How about: "The property of being an infinite non-repeating pseudo random number sequence is not sufficient to guarantee that every possible finite subsequence exists in that sequence." I demonstrate that by construction in my post. – Rafael Baptista Sep 18 '15 at 17:09
1

because it goes on infinitely and is random, it theoretically contains every finite bit string

Pi goes on infinitely, but definitely isn't random - ie. its digits can be computed by a program of O(log n) size (and therefore, finite prefixes can be generated by programs much smaller than the prefixes), which means the Kolmogorov complexity of prefixes of Pi is asymptotically less than their size. Therefore, it has yet to be proven that it contains every finite string (I don't know that).

trumpetlicks
  • 7,033
  • 2
  • 19
  • 33
jpalecek
  • 47,058
  • 7
  • 102
  • 144