0

I compress some data with the lzw module and I save them into a file ('wb' mode). This returns something like this:

'\x18\xc0\x86#\x08$\x0e\x060\x82\xc2`\x90\x98l*'

For small compressed data lzw's strings are in the above format. When I put bigger strings for compression the lzw's compressed string is splited into lines.

'\x18\xc0\x86#\x08$\x0e\x060\x82\xc2`\x90\x98l*', '\xff\xb6\xd9\xe8r4'

As I checked, string contains '\n' chars so I think I lose information if the new line missing. How can I store the string so that it will be unsplitted and stored into 1 line ?

I have tried this:

for i in s_string:
    testfile.write(i)

-----------------

testfile.write(s_string)

EDIT

def mycpsr(x):
    #x = '11010101001010101010010111110101010101001010' # some random bits for lzw input
    temp = lzw.compress(x)
    temp = "".join(temp)   
    return temp


>>> import lzw
>>> print mycpsr('10101010011111111111111111111111100000000000111111')

If I put bigger input lets say x is a sting of 0 and 1 and len(x) = 1000 and I take the compressed data and append it to a file I get multiple lines instead of 1 line.

If the file has this data:

'\t' + normal strings + '\n'
<LZW-strings(with \t\n chars)>
'\t' + normal strings + '\n'

How can i define which is lzw and which is other data ?

3 Answers3

1

You are dealing with binary data. If your data contains more than 256 bytes you have a good probability that some of the bytes correspond to the ascii code of '\n'. This will result in a binary file which contains more than one line if considered a text file.

This is not a problem as long as you deal with binary files as sequence of bytes not as a sequence of lines.

Emanuele Paolini
  • 9,912
  • 3
  • 38
  • 64
  • I think this is the right answer actually ... although its not very clear ... after re-reading several times I think you are right(+1) ... – Joran Beasley Jul 11 '14 at 18:11
1

So, your binary data contains newlines, and you want to embed it into a line-oriented document. To do that, you need to quote newlines in the binary data. One way to do it, which will quote not only newlines, but other non-printable characters, is by using base64 encoding:

import base64, lzw

def my_compress(x):
    # returns a single line, one trailing \n included
    return base64.encodestring("".join(lzw.compress(x)))

def my_decompress(line):
    return lzw.decompress(base64.decodestring(line))

If your code handles binary characters other than newline, you can make the encoding more space-efficient by only replacing newline with r"\n" (backslash followed by n), and backslash with r"\\" (two backslash characters). This will allow lzw data to reside in a single binary line, and you will need to just do the inverse transformation before calling lzw.decompress.

user4815162342
  • 141,790
  • 18
  • 296
  • 355
  • Yes you understood what I needed. I will check it right away and let you know – Βασιλης Ιωσηφιδης Jul 11 '14 at 18:51
  • @ΒασιληςΙωσηφιδης Note that your question in the original form was unanswerable. It was only after several edits and comments that I was able to understand your actual question - which is about embedding arbitrary binary data in a line-oriented document. That `lzw.compress` returns more than one string is irrelevant because those strings have nothing to do with newlines, but with the algorithm operating on managable chunks of data. – user4815162342 Jul 11 '14 at 18:57
  • sorry , im 18h debugging i cant think clearly. i was think as an option to separate the lzw data to another file and with line pointers to do the matching for the normal data. but your alternatives are wayyyyy better. – Βασιλης Ιωσηφιδης Jul 11 '14 at 19:00
  • @ΒασιληςΙωσηφιδης Just remember that base64-encoding will inflate your compressed data by a factor of exactly 4/3 (base64 by definition uses 8 bits to encode 6 bits of binary data). If you want to optimize space, implement the suggestion in the last paragraph. – user4815162342 Jul 11 '14 at 19:02
0
>>> txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum
 ante velit, adipiscing eget sodales non, faucibus vitae nunc. Praesent ac lorem
 cursus, aliquet magna sed, porta diam. Nunc lorem sapien, euismod in congue non
, tincidunt sit amet arcu. Lorem ipsum dolor sit amet, consectetur adipiscing el
it. Phasellus eleifend bibendum massa, ac convallis tellus sodales in. Suspendis
se non aliquam massa. Aenean erat ipsum, sagittis vitae elementum sit amet, iacu
lis sit amet quam. Vivamus luctus hendrerit libero at fringilla. Nullam id urna
est. Vestibulum pretium et tellus et dictum.
...
... Fusce nulla velit, lobortis at ligula eget, fermentum condimentum felis. Mae
cenas pretium posuere elit in posuere. Suspendisse gravida erat tristique, venen
atis erat at, sagittis elit. Donec laoreet lacinia nunc, eu consequat tortor. Cr
as at sem scelerisque, tristique dolor a, porta mauris. Fusce fermentum massa vi
tae arcu sagittis, et laoreet lacus suscipit. Vestibulum sed accumsan quam. Vest
ibulum eu egestas nisl. Curabitur dolor massa, auctor tempus dui ut, volutpat vu
lputate massa. Fusce vitae tortor adipiscing, gravida est at, molestie tortor. A
enean quis magna magna. Donec cursus enim ac egestas cursus. Pellentesque pulvin
ar nibh in sapien sollicitudin, eget tempus tortor pulvinar. Phasellus dignissim
, urna a sagittis tempor, nulla nulla rhoncus enim, vel molestie nisl lectus qui
s erat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sit
amet malesuada nisi, sit amet placerat sem."""
>>>
>>> print "".join(lzw.decompress(lzw.compress(txt)))

appears to correctly re decode it including the \n

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • Yes if the file consists only with lzw's data. but in my file I write some data before the lzw and some after. That's why I want lzw's string to be solid so when I read it afterwords it will not mess with my other data – Βασιλης Ιωσηφιδης Jul 11 '14 at 18:07
  • I dont understand ... it doesnt matter ... the lzw encoded stuff is simply a string ... as long as you only hand it the lzw encoded string it should properly decode it ... if you are calling decompress on uncompressed bytes then yes you will probably have a problem – Joran Beasley Jul 11 '14 at 18:09
  • @JoranBeasley The OP's actual question is about *embedding* binary data (obtained from `lzw.compress`, but it could as well come from Mars) into a line-oriented file... – user4815162342 Jul 11 '14 at 19:00