2

I am tabulating a lot of output from some network analysis, listing an edge per line, which results in dozens of gigabytes, stretching the limits of my resources (understatement). As I only deal with numerical values, it occurred to me that I might be smarter than using the Py3k defaults. I.e. some other character encoding might save me quite some space if I only have digits (and space and the occasional decimal dot). As constrained I am, I might even save on the line endings (Not to have the Windows standard CRLF duplicate). What is the best practice on this?

An example line would read like this:

62233 242344 0.42442423

(Where actually the last number is pointlessly precise, I will cut it back to three nonzero digits.)

As I will need to read in the text file with other software (Stata, actually), I cannot keep the data in arbitrary binary, though I see no reason why Stata would only read UTF-8 text. Or you simply say that avoiding UTF-8 barely saves me anything?

I think compression would not work for me, as I write the text line by line and it would be great to limit the output size even during this. I might easily be mistaken how compression works, but I thought it could save me space after the file is generated, but my issue is that my code crashes already as I am tabulating the text file (line by line).

Thanks for all the ideas and clarifying questions!

László
  • 3,914
  • 8
  • 34
  • 49
  • Can you show some lines of text from your file? – chown Sep 27 '11 at 20:38
  • Is a standard compression algorithm like gzip not good enough? Those usually work quite well when you have only a few distinct characters. – David Z Sep 27 '11 at 20:40
  • If you need to read the file with other software then there is no way to use a character encoding that is less than 8 bits per character. – Mark Ransom Sep 27 '11 at 20:55
  • @MarkRansom: I think a byte is still better than how I was currently doing. But even if I need to extract the file in the end for the other software, for the time being it is great to try compression. Thank you very much! – László Sep 27 '11 at 23:06
  • For all 7-bit ASCII data, UTF-8 has the exact same encoding and size requirements as ASCII. – tzot Oct 15 '11 at 08:43

4 Answers4

6

You can use zlib or gzip to compress the data as you generate it. You won't need to change your format at all, the compression will adjust to the characters and sequences that you use the most to create an optimal file size.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • 1
    @László, please look at the documentation for zlib, particularly `Compress.compress`. It is able to take output a string at a time and provide a compressed string that you write to your actual file. Gzip uses this to provide a file object that you can write to as if it were an uncompressed file to make it even simpler. – Mark Ransom Sep 27 '11 at 21:00
  • I tried gzip, but one of my lines intending to write to file line by line have the following error? I'll continue to research this, but I am also grateful for any quick help too! Traceback (most recent call last): File "parser8.py", line 150, in targets[i].write('0 '+str(H.number_of_nodes())+' SHAPE KEY') File "/n/sw/python-3.2/lib/python3.2/gzip.py", line 312, in write self.crc = zlib.crc32(data, self.crc) & 0xffffffff TypeError: 'str' does not support the buffer interface – László Sep 28 '11 at 14:20
  • @László, I don't know why you're having problems. This works for me in Python 2.7: `f=gzip.GzipFile(r'c:\temp\temp.gz','w')` `f.write('0 '+str(5)+' SHAPE KEY')` `f.close()` – Mark Ransom Sep 28 '11 at 14:40
  • 1
    @László this might help: http://stackoverflow.com/questions/2176511/how-do-i-convert-a-string-to-a-buffer-in-python-3-1 – Mark Ransom Sep 28 '11 at 14:46
  • Thanks again. Would you recommend the encode method to the bytes function? I found this: http://stackoverflow.com/questions/5471158/typeerror-str-does-not-support-the-buffer-interface – László Sep 28 '11 at 15:01
  • 1
    @László I didn't know which method to recommend so I asked it as a question: http://stackoverflow.com/questions/7585435/best-way-to-convert-string-to-bytes-in-python-3 – Mark Ransom Sep 28 '11 at 15:23
3

Avoid the character encodings entirely and save your data in a binary format. See Python's struct. Ascii-encoded a value like 4-billion takes 10 bytes, but fits in a 4-byte integer. There are a lot of downsides to a custom binary format (its hard to manually debug, or inspect with other tools, etc)

P.T.
  • 24,557
  • 7
  • 64
  • 95
  • If you apply compression, the resultant size is similar to text. It is a surprising finding to me. See my post below. – Wai Yip Tung Sep 27 '11 at 21:34
2

I have done some study on this. Clever encoding does not matter once you apply compression. Even if you use some binary encoding, they seems to contain the same entropy and end up in similar size after compression.

The Power of Gzip

Yes there are Python library allow you to stream output and automatically compress it.

Lossy encoding does save space. Cutting down the precision helps.

Wai Yip Tung
  • 18,106
  • 10
  • 43
  • 47
0

I don't know the capabilities of data input in Stata, and a quick search reveals that said capabilities are described in the User's Guide, which seems to be available only on dead-tree copies. So I don't know if my suggestion is feasible.

An instant saving of half the size would be if you used 4-bits per character. You have an alphabet of 0 to 9, period, (possibly) minus sign, space and newline, which are 14 characters fitting perfectly in 2**4==16 slots.

If this can be used in Stata, I can help more with suggestions for quick conversions.

tzot
  • 92,761
  • 29
  • 141
  • 204