35

Recently I came across one nice problem, which turned up as simple to understand as hard to find any way to solve. The problem is:

Write a program, that reads a text from input and prints some other program on output. If we compile and run the printed program, it must output the original text.

The input text is supposed to be rather large (more than 10000 characters).

The only (and very strong) requirement is that the size of the archive (i.e. the program printed) must be strictly less than the size of the original text. This makes impossible obvious solutions like

std::string s;
/* read the text into s */
std::cout << "#include<iostream> int main () { std::cout<<\"" << s << "\"; }";

I believe some archiving techniques are to be used here.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
Grigor Gevorgyan
  • 6,753
  • 4
  • 35
  • 64
  • @Tomalak: Actually, no ideas for solution so far, that's why I posted the question – Grigor Gevorgyan Jun 29 '11 at 19:36
  • 1
    What do you mean by "text"? [A-Z][a-z][some-punctuation]*? – Dr. belisarius Jun 29 '11 at 19:38
  • @belisarius: No specification, it can be arbitrary text – Grigor Gevorgyan Jun 29 '11 at 19:40
  • 5
    So you could start with the Library of Congress, run it through this program many times, and wind up with a program of a few lines which could reproduce the entire library. Do you feel one of your eyebrows moving? – Beta Jun 29 '11 at 19:51
  • @Beta: Well, noone said it should be a few lines. The point is it should be smaller, than the input text. And I believe it will be much more smaller than the Library of Congress – Grigor Gevorgyan Jun 29 '11 at 19:54
  • @Grigor Gevorgyan: you're right, I must amend my comment: you would wind up with a program of *10000 characters* (a few pages) which could reproduce the entire library. – Beta Jun 29 '11 at 20:11
  • @Beta: The texts in Library of Congress, I believe, are not arbitrary, but literally texts, which can be zipped easily – Grigor Gevorgyan Jun 29 '11 at 20:14
  • @Grigor you are missing the point, what happens if you pipe the output of your program back into itself so it get's compressed another time? Eventually it will be imposable to compress any further! this can happen at any file size, so a clever person could create a incompressible input that is 10000 characters long and there is nothing your program can do about it. – Scott Chamberlain Jun 29 '11 at 20:22
  • My answer below has a mathematically rigorous version of this argument. @Beta and @Scott Chamberlain are both correct in that some strings can never be compressed for this very reason. – templatetypedef Jun 29 '11 at 20:24
  • 3
    If you define "Text" as "printable characters on the keyboard" then @Aasmund Eldhuset solution works fine, however if "Text" means a array of bytes between the value 0 and 255 this is impossable as @templatetypedef says due to your requirement of `the size of the archive (i.e. the program printed) must be strictly less than the size of the original text.` as I can craft a 10000 byte file that is incompressible. – Scott Chamberlain Jun 29 '11 at 20:26
  • This is a variant on the [Quine](http://en.wikipedia.org/wiki/Quine_%28computing%29) concept. – Brian Kelly Jun 29 '11 at 19:37
  • @All: Thanks everyone, especially templatetypedef, for your explanations, this was really useful for me :) – Grigor Gevorgyan Jun 29 '11 at 20:30
  • 1
    This might be a good candidate for code golf. – crazy2be Jun 30 '11 at 03:15

5 Answers5

54

Unfortunately, such a program does not exist.

To see why this is so, we need to do a bit of math. First, let's count up how many binary strings there are of length n. Each of the bits can be either a 0 or 1, which gives us one of two choices for each of those bits. Since there are two choices per bit and n bits, there are thus a total of 2n binary strings of length n.

Now, let's suppose that we want to build a compression algorithm that always compresses a bitstring of length n into a bitstring of length less than n. In order for this to work, we need to count up how many different strings of length less than n there are. Well, this is given by the number of bitstrings of length 0, plus the number of bitstrings of length 1, plus the number of bitstrings of length 2, etc., all the way up to n - 1. This total is

20 + 21 + 22 + ... + 2n - 1

Using a bit of math, we can get that this number is equal to 2n - 1. In other words, the total number of bitstrings of length less than n is one smaller than the number of bitstrings of length n.

But this is a problem. In order for us to have a lossless compression algorithm that always maps a string of length n to a string of length at most n - 1, we would have to have some way of associating every bitstring of length n with some shorter bitstring such that no two bitstrings of length n are associated with the same shorter bitstream. This way, we can compress the string by just mapping it to the associated shorter string, and we can decompress it by reversing the mapping. The restriction that no two bitstrings of length n map to the same shorter string is what makes this lossless - if two length-n bitstrings were to map to the same shorter bitstring, then when it came time to decompress the string, there wouldn't be a way to know which of the two original bitstrings we had compressed.

This is where we reach a problem. Since there are 2n different bitstrings of length n and only 2n-1 shorter bitstrings, there is no possible way we can pair up each bitstring of length n with some shorter bitstring without assigning at least two length-n bitstrings to the same shorter string. This means that no matter how hard we try, no matter how clever we are, and no matter how creative we get with our compression algorithm, there is a hard mathematical limit that says that we can't always make the text shorter.

So how does this map to your original problem? Well, if we get a string of text of length at least 10000 and need to output a shorter program that prints it, then we would have to have some way of mapping each of the 210000 strings of length 10000 onto the 210000 - 1 strings of length less than 10000. That mapping has some other properties, namely that we always have to produce a valid program, but that's irrelevant here - there simply aren't enough shorter strings to go around. As a result, the problem you want to solve is impossible.

That said, we might be able to get a program that can compress all but one of the strings of length 10000 to a shorter string. In fact, we might find a compression algorithm that does this, meaning that with probability 1 - 210000 any string of length 10000 could be compressed. This is such a high probability that if we kept picking strings for the lifetime of the universe, we'd almost certainly never guess the One Bad String.


For further reading, there is a concept from information theory called Kolmogorov complexity, which is the length of the smallest program necessary to produce a given string. Some strings are easily compressed (for example, abababababababab), while others are not (for example, sdkjhdbvljkhwqe0235089). There exist strings that are called incompressible strings, for which the string cannot possibly be compressed into any smaller space. This means that any program that would print that string would have to be at least as long as the given string. For a good introduction to Kolmogorov Complexity, you may want to look at Chapter 6 of "Introduction to the Theory of Computation, Second Edition" by Michael Sipser, which has an excellent overview of some of the cooler results. For a more rigorous and in-depth look, consider reading "Elements of Information Theory," chapter 14.

Hope this helps!

Mihran Hovsepyan
  • 10,810
  • 14
  • 61
  • 111
templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • 2
    Whilst you couldn't guarantee that the program would absolutely _always_ be smaller than the input, you can reasonably assume it for typical inputs... depending upon what those inputs are. That's how ZIPs work: in practice, have you ever seen a ZIP file whose contents are larger compressed than uncompressed? – Lightness Races in Orbit Jun 29 '11 at 19:42
  • 1
    @Jason- Can you explain why that sentence is false? I don't see why not. @Tomalak Geret'kal- You're absolutely correct, but the OP's question is in general unsolvable. For most strings you're definitely correct. – templatetypedef Jun 29 '11 at 19:43
  • 1
    @Jason- Sorry, perhaps I should rephrase that. There exists **at least one string** for which no shorter representation exists. This means that **in general** you can definitely compress strings, but for pathologically bad strings you can't produce a program shorter than the string itself that could print it. – templatetypedef Jun 29 '11 at 19:45
  • @templatetypedef: \*nods\*; @Jason: As templatetypedef explains, mathematically you cannot guarantee that for all inputs. – Lightness Races in Orbit Jun 29 '11 at 19:45
  • 5
    @Jason- The OP does mention (in bold text and with the phrase "very strong requirement") that it has to work for all inputs. The existence of incompressible strings says that you can't possibly do this. I agree with you that in general you can compress most strings, but for the OP's question the existence of a single bad string rules out any possible program to do this. – templatetypedef Jun 29 '11 at 19:48
  • @jason: @templatetypedef's point is that you can't ***ALWAYS*** achieve compression. Can you back the opposite with any reference? – Armen Tsirunyan Jun 29 '11 at 19:49
  • 2
    @templatetypedef: Can you prove (or give a source) that for any N there exists a string longer than N that is not compressible (supposing that the alphabet is less than N symbols)? Because the OP specified that the input is greater than 10000 symbols (whereas the a;phabet is no more than 256 symbols) – Armen Tsirunyan Jun 29 '11 at 19:57
  • 6
    @Armen Tsirunyan- Sure! The above pigeonhole argument works for arbitrary N. So there are 2^10000 possible strings of length 10000, but "only" 2^10000 - 1 strings of length shorter than that (and thus there are at most 2^10000 - 1 programs shorter than 10000 characters). Consequently, there's no way that every string of length at least 2^10000 can be invertibly mapped to a smaller program, since if it could at least two strings would have to map to the same program, which couldn't tell which of the two strings to print. – templatetypedef Jun 29 '11 at 20:03
  • @templatetypedef: this is awesomely interesting! :) I hope there's a mistake in your reasoning, but it seems there isn't :) – Armen Tsirunyan Jun 29 '11 at 20:06
  • 2
    @Armen, http://en.wikipedia.org/wiki/Pigeonhole_principle . Whatever the alphabet may be, there exists no complete inversible function from strings of N symbols to strings of less than N symbols. If you restricted the set of symbols allowed to something smaller than those allowed in C strings (ie, mapped from a string of N symbols from an alphabet of X allowed symbols to a string of X allowed symbols), you might be able to pull it off, but that's going to be rather restrictive. – bdonlan Jun 29 '11 at 20:06
  • @bdonlan- I actually like that idea of increasing the size of the alphabet. I wonder what the notion of "incompressibility" looks like from that standpoint? Could you do something like look at the total number of bits per symbol, then say that while you could compress it in terms of the number of **characters**, you couldn't reduce the number of **bits**? – templatetypedef Jun 29 '11 at 20:29
  • @templatetypedef Such sleight-of-hand doesn't buy you anything particularly useful - the input is still incompressible. – Nick Johnson Jun 30 '11 at 01:32
  • 1
    @Nick: It means that normal text containing words (e.g., human language or programming language) usually _is_ compressible, but binary data is not necessarily so (and compressed data, definitely not!) The key point is that any particular string has a certain amount of Shannon information in it, and that acts as a hard bound on how compressible it is. – Donal Fellows Jul 01 '11 at 07:02
  • Here's a link to two elementary proofs of that summation: http://jeremykun.wordpress.com/2011/07/01/sums-of-k-powers/, and another to an overview of Kolmogorov Complexity: http://jeremykun.wordpress.com/2012/04/21/kolmogorov-complexity-a-primer/. – JeremyKun Apr 21 '12 at 23:23
9

If we are talking about ASCII text...

I think this actually could be done, and I think the restriction that the text will be large than 10000 chars is there for a reason (to give you coding room).

People here are saying that the string cannot be compressed, yet it can.

Why?

Requirement: OUTPUT THE ORIGINAL TEXT

Text is not data. When you read input text you read ASCII chars (bytes). Which have both printable and non printable values inside.

Take this for example:

ASCII values    characters
0x00 .. 0x08    NUL, (other control codes)                                  
0x09 .. 0x0D    (white-space control codes: '\t','\f','\v','\n','\r')
0x0E .. 0x1F    (other control codes)
... rest of printable characters

Since you have to print text as output, you are not interested in the range (0x00-0x08,0x0E-0x1F). You can compress the input bytes by using a different storing and retrieving mechanism (binary patterns), since you don't have to give back the original data but the original text. You can recalculate what the stored values mean and readjust them to bytes to print. You would effectively loose only data that was not text data anyway, and is therefore not printable or inputtable. If WinZip would do that it would be a big fail, but for your stated requirements it simply does not matter.

Since the requirement states that the text is 10000 chars and you can save 26 of 255, if your packing did not have any loss you are effectively saving around 10% space, which means if you can code the 'decompression' in 1000 (10% of 10000) characters you can achieve that. You would have to treat groups of 10 bytes as 11 chars, and from there extrapolate te 11th, by some extrapolation method for your range of 229. If that can be done then the problem is solvable.

Nevertheless it requires clever thinking, and coding skills that can actually do that in 1 kilobyte.

Of course this is just a conceptual answer, not a functional one. I don't know if I could ever achieve this.

But I had the urge to give my 2 cents on this, since everybody felt it cannot be done, by being so sure about it.

The real problem in your problem is understanding the problem and the requirements.

Marino Šimić
  • 7,318
  • 1
  • 31
  • 61
8

What you are describing is essentially a program for creating self-extracting zip archives, with the small difference that a regular self-extracting zip archive would write the original data to a file rather than to stdout. If you want to make such a program yourself, there are plenty of implementations of compression algorithms, or you could implement e.g. DEFLATE (the algorithm used by gzip) yourself. The "outer" program must compress the input data and output the code for the decompression, and embed the compressed data into that code.

Pseudocode:

string originalData;
cin >> originalData;
char * compressedData = compress(originalData);
cout << "#include<...> string decompress(char * compressedData) { ... }" << endl;
cout << "int main() { char compressedData[] = {";
(output the int values of the elements of the compressedData array)
cout << "}; cout << decompress(compressedData) << endl; return 0; }" << endl;
Aasmund Eldhuset
  • 37,289
  • 4
  • 68
  • 81
0
  1. Assuming "character" means "byte" and assuming the input text may contains at least as many valid characters as the programming language, its impossible to do this for all inputs, since as templatetypedef explained, for any given length of input text all "strictly smaller" programs are themselves possible inputs with smaller length, which means there are more possible inputs than there can ever be outputs. (It's possible to arrange for the output to be at most one bit longer than the input by using an encoding scheme that starts with a "if this is 1, the following is just the unencoded input because it couldn't be compressed further" bit)

  2. Assuming its sufficient to have this work for most inputs (eg. inputs that consist mainly of ASCII characters and not the full range of possible byte values), then the answer readily exists: use gzip. That's what its good at. Nothing is going to be much better. You can either create self-extracting archives, or treat the gzip format as the "language" output. In some circumstances you may be more efficient by having a complete programming language or executable as your output, but often, reducing the overhead by having a format designed for this problem, ie. gzip, will be more efficient.

Jack V.
  • 1,381
  • 6
  • 20
0

It's called a file archiver producing self-extracting archives.

Frigo
  • 1,709
  • 1
  • 14
  • 32