2

I am struggling to work with zipped files e.g.:

julia> using CodecZlib

julia> text = open("2004_CORE.zip")
IOStream(<file 2004_CORE.zip>)

I have several of those type of files to process. Each will contain at least one gzipped xml-file and one or more plain text csv file(s).

My question is: How do I determine which files and filetypes are contained in the zip-file? And how do I stream those files separately to be able to process the XML-files with LightXML and the CSV-files with DataFrames?

1 Answers1

2

zlib does not, on its own, process zip files. Note that zip and gzip are two different things. You need something that parses the zip file format. ZipFile may help.

If you don't want to use ZipFile, which is said to be slow, then you will need to pick apart the zip file format yourself. You can find it documented here. Then you can use the deflate functionality of CodecZlib to decompress the raw deflate data contained in each zip file entry. (Almost all zip files use only the deflate or stored methods.)

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thanks for your effort. I am quite aware about the differences between the filetypes. CodecZlib can handle both Gzip and Zip files. That is not the point of my question. If you look at https://discourse.julialang.org/t/reading-files-embedded-in-a-zip-file/10675/9 you will see that I was recommended to use the CodecZlib and that Zipfile has problems of its own. – Johann Spies May 10 '18 at 06:18
  • This one: https://github.com/bicycle1885/CodecZlib.jl ? If so, it does not handle zip files. Only zlib, gzip, and deflate. The link in your comment says the opposite of what you're saying. It says _"I had hoped to either modernize ZipFiles.jl or to write a CodecZlib-like package for zip archives but haven’t been able to make the time to do so. As I result I have dropped support for zip archives and am recommending that our users use gzip compression instead."_ – Mark Adler May 10 '18 at 07:06
  • From the README of CodecZlib.jl:This package exports following codecs and streams: Codec Stream GzipCompressor GzipCompressorStream GzipDecompressor GzipDecompressorStream ZlibCompressor ZlibCompressorStream ZlibDecompressor ZlibDecompressorStream DeflateCompressor DeflateCompressorStream DeflateDecompressor DeflateDecompressorStream – Johann Spies May 10 '18 at 09:55
  • Thanks for the reference to https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT. It is valuable. – Johann Spies May 10 '18 at 10:00
  • @JohannSpies Exactly. None of gzip, zlib, or deflate are the zip format. CodecZlib _does not_ handle zip files. You may find [this answer](https://stackoverflow.com/a/20765054/1180620) useful. – Mark Adler May 10 '18 at 17:02