4

How can I unzip a .zip file with Python into some directory output_dir and fetch a list of all the directories made by the unzipping as a result? For example, if I have:

unzip('myzip.zip', 'outdir')

outdir is a directory that might have other files/directories in it. When I unzip myzip.zip into it, I'd like unzip to return all the directories made in outdir/ as a result of the zipping. Here is my code so far:

import zipfile
def unzip(zip_file, outdir):
    """
    Unzip a given 'zip_file' into the output directory 'outdir'.
    """
    zf = zipfile.ZipFile(zip_file, "r")
    zf.extractall(outdir)

How can I make unzip return the dirs it creates in outdir? thanks.

Edit: the solution that makes most sense to me is to get ONLY the top-level directories in the zip file and then recursively walk through them which will guarantee that I get all the files made by the zip. Is this possible? The system specific behavior of namelist makes it virtually impossible to rely on

  • Actually, if one or more of the directories included in the zip archive already existed locally, you _cannot_ be sure that any given directory will contain only files extracted from the archive. In that case, you need to scan the filesystem _before_ extracting, then _after_ extracting and calculated the difference. Seems like a lot of work though. – isedev Feb 04 '13 at 23:21
  • any reason why the makers of extractall() wouldn't just have that function return an array of file names that were just created? – mwag Nov 04 '16 at 17:51

4 Answers4

9

You can read the contents of the zip file with the namelist() method. Directories will have a trailing path separator:

>>> import zipfile
>>> zip = zipfile.ZipFile('test.zip')
>>> zip.namelist()
['dir2/', 'file1']

You can do this before or after extracting contents.

Depending on your operating environment, the result of namelist() may be limited to the top-level paths of the zip archive (e.g. Python on Linux) or may cover the full contents of the archive (e.g. IronPython on Windows).

The namelist() returns a complete listing of the zip archive contents, with directories marked with a trailing path separator. For instance, a zip archive of the following file structure:

./file1
./dir2
./dir2/dir21
./dir3
./dir3/file3
./dir3/dir31
./dir3/dir31/file31

results in the following list being returned by zipfile.ZipFile.namelist():

[ 'file1', 
  'dir2/', 
  'dir2/dir21/', 
  'dir3/', 
  'dir3/file3', 
  'dir3/dir31/', 
  'dir3/dir31/file31' ]
isedev
  • 18,848
  • 3
  • 60
  • 59
  • 1
    depends on the implementation, i guess. In IronPython, zip.namelist() shows *all* the files in the archive, not just the top level – David J Feb 04 '13 at 21:52
  • hmm... might be one of those patent/license issues. Just top-level on my Linux environment. If anyone knows a solution to that, I'd be happy to hear it. – isedev Feb 04 '13 at 22:01
  • I'd like to see all the dirs, not just toplevel, too. –  Feb 04 '13 at 22:07
  • something was up with my environment... re-installed zip related packages on Fedora 17 and now getting full paths. Very odd... anyway, my bad, sorry for confusion. – isedev Feb 05 '13 at 01:53
  • This will tell you what's in the zip file, but it doesn't tell you which of those directory names will be created when the file is unzipped -- some of the directories may already exist prior to unzipping. – Larry Lustig Feb 05 '13 at 02:01
1

ZipFile.namelist will return a list of the names of the items in an archive. However, these names will only be the full names of files including their directory path. (A zip file can only contain files, not directories, so directories are implied by archive member names.) To determine the directories created, you need a list of every directory created implicitly by each file.

The dirs_in_zip() function below will do this and collect all dir names into a set.

import zipfile
import os

def parent_dirs(pathname, subdirs=None):
    """Return a set of all individual directories contained in a pathname

    For example, if 'a/b/c.ext' is the path to the file 'c.ext':
    a/b/c.ext -> set(['a','a/b'])
    """
    if subdirs is None:
        subdirs = set()
    parent = os.path.dirname(pathname)
    if parent:
        subdirs.add(parent)
        parent_dirs(parent, subdirs)
    return subdirs


def dirs_in_zip(zf):
    """Return a list of directories that would be created by the ZipFile zf"""
    alldirs = set()
    for fn in zf.namelist():
        alldirs.update(parent_dirs(fn))
    return alldirs


zf = zipfile.ZipFile(zipfilename, 'r')

print(dirs_in_zip(zf))
Francis Avila
  • 31,233
  • 6
  • 58
  • 96
  • Does _not_ return full pathnames on all platforms. See other answer. – isedev Feb 04 '13 at 23:19
  • @isedev, I just tested with Python 2.7.3 on Linux (Ubuntu 12.04.2) and Python 2.7.1 on OS X 10.7.5 and got full pathnames in both cases. I can't see how it's even possible *not* to give full pathnames, since there's only one place in the zipinfo structure where a name may be stored. `namelist()` should be the same as `[zinfo.name for zinfo in zfile.infolist()]` Do you know a specific platform where this is *not* true? – Francis Avila Feb 05 '13 at 00:14
  • something was up with my environment... re-installed `zip` related packages on Fedora 17 and now getting full paths. Very odd... anyway, my bad, sorry for confusion. Just one more thing though: zip can contain directories... try this: 'touch file1; mkdir dir2; zip test.zip *' -> dir2 will be listed in zip. Anyway, thx, will update my answer accordingly. – isedev Feb 05 '13 at 01:43
0

Let it finish and then read the content of the directory - here is a good example of this.

Community
  • 1
  • 1
abolotnov
  • 4,282
  • 9
  • 56
  • 88
  • But I won't know which files were created by unzipping and which were there already. Keep in mind the output directory could have files in it –  Feb 04 '13 at 21:47
  • fair enough, use namelist() or infolist() to see the archive's content. – abolotnov Feb 04 '13 at 22:11
0

Assuming no one else will be writing the target directory at the same time, walk the directory recursively prior to unzipping, then afterwards, and compare the results.

Larry Lustig
  • 49,320
  • 14
  • 110
  • 160