Number of lines differ in text and zipped file

Question

I zippded few files in unix and later found zipped files have different number of lines than the raw files.

>>wc -l
70308 /location/filename.txt
2931  /location/filename.zip

How's this possible?

Do you mean after unzipping the file you got different line count? — SMA, Dec 23 '14 at 12:56

Gyapti Jain · Answer 1 · 2015-01-31T07:13:25.753

1

zip files are binary files. wc command is targeted for text files.

zip compressed version of a text file may contain more or less number of newline characters because zipping is not done line per line. So if they both give same output for all commands, there is no point of compressing and keeping the file in different format.

From wc man page:

-l, --lines
    print the newline counts

To get the matching output, you should try

$ unzip -c | wc -l # Decompress on stdout and count the lines

This would give (about) 3 extra lines (if there is no directory structure involved). If you compressed directory containing text file instead of just file, you may see a few more lines containing the file/directory information.

edited Jan 31 '15 at 07:13

answered Dec 23 '14 at 12:54

Gyapti Jain

4,056
20
40

I had read somewhere in the process of zipping, they keep a dictionary kind of object containing characters and their representation and just put a flag when this character is found.With that logic, shouldn't I get only one '\n' so wc -l shouldn't be 1? – Kuber Dec 23 '14 at 12:59
1

Yes, you are talking about LZW compression algorithm. zip uses LZW and many more algorithms. This dictionary is prepared on the fly. And finally zip becomes a binary file and wc does not perform a decompression and parses the zip file as is. That's why you are getting the count of newline characters in zip file and not that of decompressed file. – Gyapti Jain Dec 23 '14 at 13:00

score -1 · Accepted Answer · answered Dec 23 '14 at 15:54

-1

In compression algorithm word/character is replaced by some binary sequence.

let's suppose \n is replaced by 0011100 and some other character 'x' is replaced by 0001010(\n)

so wc program search for sequence 0001010 in zip file and count of these can vary.

answered Dec 23 '14 at 15:54

aibotnet

1,326
1
13
27

1

So you think zip uses 8 bit for table? `\n` may be compressed to different indices depending on the following characters. For ex: `\n`, `\nB`, `\nBut` may be compressed to different binary sequence. Moreover first `\n` in file may be encoded different from next occurrence of the same character. – Gyapti Jain Jan 03 '15 at 03:58
I never said zip is using 8-bit for compression , i am talking for any general compression algorithm .i am giving you here huffman tree compression algorithm example suppose in a file only a,c,d,x,\n character are available. a occurs 15 times , c occurs 10 times ,d occurs 100 times , x occurs 30 times and \n occurs 5 times then encoding will be like this in binary a->001 c->0001 d->1 x->01 \n->0000.in this case wc -l will return 0 bcz no sequence for \n (1010 in binary) is available in compressed file. – aibotnet Jan 06 '15 at 13:39
by the way .ZIP format uses a 32-bit CRC algorithm you can check here http://en.wikipedia.org/wiki/Zip_(file_format) – aibotnet Jan 06 '15 at 14:09
1

Sorry no offence and I did not downvote your answer, but do you even understand what CRC is? CRC or [cyclic redundancy check](http://stackoverflow.com/questions/2587766/how-is-a-crc32-checksum-calculated) is used to maintain integrity of file. (If CRC fails, it means the file is corrupted and thus `zip` provide greater protection against data losses) – Gyapti Jain Jan 06 '15 at 15:49
1

may be CRC means different here , that's why i pasted Wikipedia link also.This line is from wikipedia. you are right cyclic redundancy check is like checksum it can't be used for compression. – aibotnet Jan 06 '15 at 16:43

Number of lines differ in text and zipped file

2 Answers2