3

I want to open the files inside a .zip file and read them. In this zip file, I have numerous .gz files, like a.dat.gz, b.dat.gz, and so on.

My code so far:

for i in $(unzip -p sample.zip)
do
    for line in $(zcat "$i")
    do
        # do some stuff here
    done
done
Adam Katz
  • 14,455
  • 5
  • 68
  • 83
ggupta
  • 675
  • 1
  • 10
  • 27
  • Please elaborate on "didn't work". – Scott Hunter Aug 24 '17 at 13:36
  • @JohnnyRockex – Your edits changed the question. I had to put them back to understand the nested archive structure this question is asking about. You can see my incorrect assumptions in a [past iteration](https://stackoverflow.com/revisions/45868356/2) of my answer. – Adam Katz Aug 24 '17 at 18:42
  • Just wondering, how @JohnnyRockex edited was confirmed, i didn't do it ;' – ggupta Aug 25 '17 at 05:13
  • Lads, maybe it was a glitch in the matrix, but the edit I did was minimal, and the latest revision still has spelling mistakes :D – Johnny Rockex Aug 25 '17 at 08:16
  • The version Johnny referred to had a minor grammatical error ("file" in place of "files") but no spelling mistakes. I've now fixed that too. @ggupta – Stack Exchange sites use a peer review system for edits from users under 2000 reputation (this is called [edit privilege](https://stackoverflow.com/help/privileges/edit)), requiring (iirc) 2+ users with 2000+ reputation to approve it before it is accepted. Both Johnny and I have this privilege, so the peer review system was bypassed. (So I guess "[the matrix](https://en.wikipedia.org/wiki/The_Matrix)" is in Johnny's head. Lucky guy!) – Adam Katz Aug 25 '17 at 15:33
  • See also: https://superuser.com/questions/384611/ (there's a solution using zipinfo) – user202729 Mar 25 '20 at 10:49

2 Answers2

4

You are correct in needing two loops. First, you need a list of files inside the archive. Then, you need to iterate within each of those files.

unzip -l sample.zip |sed '
  /^ *[0-9][0-9]* *2[0-9-]*  *[0-9][0-9]:[0-9][0-9]  */!d; s///
' |while IFS= read file
  unzip -p sample.zip "$file" |gunzip -c |while IFS= read line
    # do stuff to "$line" here
  done
done

This assumes that each file in the zip archive is itself a gzip archive. You'll otherwise get an error from gunzip.

Code walk

unzip -l archive.zip will list the contents. Its raw output looks like this:

Archive:  test.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        9  2017-08-24 13:45   1.txt
        9  2017-08-24 13:45   2.txt
---------                     -------
       18                     2 files

We therefore need to parse it. I've chosen to parse with sed because it's fast, simple, and preserves whitespace properly (what if you have files with tabs in their names?) Note, this will not work if files have line breaks in them. Don't do that.

The sed command uses a regex (explanation here) to match the entirety of lines containing file names except for the file names themselves. When the matcher fires, sed is told not to delete (!d), which really tells sed to skip anything that does not match (like the title line). A second command, s///, tells sed to replace the previously matched text with an empty string, therefore the output is one file name per line. This gets piped into a while loop as $file. (The IFS= part before read prevents spaces from being stripped from either end, see the comments below.)

We can then unzip just the file we're iterating on, again using unzip -p to get it printed to standard output so it can be stored in the inner while loop as $line.

Experimental simplification

I'm not sure how reliable this would be, but you might be able to do this more simply as:

unzip -p sample.zip |gunzip -c |while read line
  # do stuff to "$line"
done

This should work because unzip -p archive spits out the contents of each file in the archive, all concatenated together without any delimiters or metadata (like the file name) and because the gzip format accepts concatenating archives together (see my notes on concatenated archives), so the gunzip -c pipeline command sees raw gzip data and decompresses it out on the console, which is then passed to the shell's while loop. You will lack file boundaries and names in this approach, but it's much faster.

Adam Katz
  • 14,455
  • 5
  • 68
  • 83
  • this piece of code `unzip -p sample.zip |gunzip -c |while read line # do stuff to "$line" done`, wonderfully, just wanted to know what actually gunzip -c is doing? Thanks in advance – ggupta Aug 25 '17 at 05:26
  • @ggupta – I've added a bit more explanation to the answer. You described your zip archive as containing multiple gzipped archives, so we need to decompress the zip and then the gzip. `gunzip -c` performs that second decompression. – Adam Katz Aug 25 '17 at 15:26
  • `while read file` doesn't represent all filenames correctly. Names that start or end with whitespace or names that contain literal backslashes are going to be misrepresented. – Charles Duffy Aug 25 '17 at 16:22
  • (Also, the output format of `unzip -l` is not particularly well-specified, making relying on it a questionable choice -- indeed, it's explicitly documented to have an output format other than the one you're assuming when some optional compile-time flags are given; and the documentation doesn't specify whether the date formatting &c. will hold across locales). – Charles Duffy Aug 25 '17 at 16:26
  • (and as yet another corner case, one I almost didn't expect to work, `touch $'name\nwith\nnewline' && zip test.zip $'name\nwith\nnewline'` actually succeeds in creating a file with newlines in its name stored inside a zip archive). – Charles Duffy Aug 25 '17 at 16:44
  • Yeah, I was just wondering about the pkzip format and how well it conforms with oddities like these. I guess it's better than I had expected. I agree that parsing `unzip -l` or `unzip -v` is suboptimal but I don't know of another easy way to do it (without actually unzipping the whole archive to a temporary area). – Adam Katz Aug 25 '17 at 16:47
  • The `while read` limitation could be overcome with `|xargs -d "\n" …` assuming the loop can be adapted in that manner, though that wouldn't resolve the `\n`-in-filename issue. (Without GNU, you would need `|awk '{printf"%s%c",$0,0}' |xargs -0 …`) – Adam Katz Aug 25 '17 at 16:53
  • 1
    @AdamKatz, `IFS= read -r line` would be a less-intrusive change to handle the leading-and-trailing whitespace case and literal backslashes. It's more than just newlines that would still be a problem, though -- zip escapes files in the listing in a way different than how the names are literally expected to come out in output (so in the example of newlines in filenames, `zip -l` shows them as `^J`; I'm sure it has other nonliteral escapes for other nonprintable characters). – Charles Duffy Aug 25 '17 at 17:36
  • (and you're right that this is hard to solve using only native tools, hence my reliance on the python standard-library `zipfile` module). – Charles Duffy Aug 25 '17 at 17:38
  • (that the behavior enabled by the `-r` option to `read` isn't on-by-default is arguably a rather significant design misfeature in the POSIX spec: Without `-r`, for instance, a backslash ending a line of input is an instruction for `read` to store the following line of input in the same variable in the same invocation; whereas `\f`, for instance, is just changed to the literal `f`. Thus, one needs to *explicitly request* that data be passed through unmodified rather than munged, as opposed to the sane, non-data-munging thing being default behavior). – Charles Duffy Aug 25 '17 at 18:05
  • Yeah, it'd be nice if `zip` and `unzip` played by the same rules as `gzip`, `bzip2`, `xz`, etc. I've started and stopped writing wrappers for this compatibility in the past. – Adam Katz Aug 25 '17 at 20:20
0

This is harder than you might think to do robustly in shell. (The existing answer works in the common case, but archives with surprising filenames included will confuse it). The better option is to use a language with native zip file support -- such as Python. (This can also have the advantage of not needing to open your input file more than once!)

If the individual files are small enough that you can fit a few copies of each in memory, the following will work nicely:

read_files() {
  python -c '
import sys, zipfile, zlib

zf = zipfile.ZipFile(sys.argv[1], "r")
for content_file in zf.infolist():
    content = zlib.decompress(zf.read(content_file), zlib.MAX_WBITS|32)
    for line in content.split("\n")[:-1]:
        sys.stdout.write("%s\0%s\0" % (content_file.filename, line))
' "$@"
}

while IFS= read -r -d '' filename && IFS= read -r -d '' line; do
  printf 'From file %q, read line: %s\n' "$filename" "$line"
done < <(read_files yourfile.zip)

If you really want to split the file-listing and file-reading operations off from each other, doing that robustly might look like:

### Function: Extract a zip's content list in NUL-delimited form
list_files() {
  python -c '
import sys, zipfile, zlib

zf = zipfile.ZipFile(sys.argv[1], "r")
for content_file in zf.infolist():
    sys.stdout.write("%s\0" % (content_file.filename,))
' "$@"
}

### Function: Extract a single file's contents from a zip file
read_file() {
  python -c '
import sys, zipfile, zlib

zf = zipfile.ZipFile(sys.argv[1], "r")
sys.stdout.write(zf.read(sys.argv[2]))
' "$@"
}

### Main loop
process_zip_contents() {
  local zipfile=$1
  while IFS= read -r -d '' filename; do
    printf 'Started file: %q\n' "$filename"
    while IFS= read -r line; do
      printf '  Read line: %s\n' "$line"
    done < <(read_file "$zipfile" "$filename" | gunzip -c)
  done < <(list_files "$zipfile")
}

To smoketest the above -- if an input file is created as follows:

printf '%s\n' '1: line one' '1: line two' '1: line three' | gzip > one.gz
printf '%s\n' '2: line one' '2: line two' '2: line three' | gzip > two.gz
cp one.gz 'name
with
newline.gz'
zip test.zip one.gz two.gz $'name\nwith\nnewline.gz'
process_zip_contents test.zip

...then we have the following output:

Started file: $'name\nwith\nnewline.gz'
  Read line: 1:line one
  Read line: 1:line two
  Read line: 1:line three
Started file: one.gz
  Read line: 1: line one
  Read line: 1: line two
  Read line: 1: line three
Started file: two.gz
  Read line: 2: line one
  Read line: 2: line two
  Read line: 2: line three
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441