You've stated that you need to support "tar, bz2, zip or tar.gz". Python's tarfile
module will automatically handle gz and bz2 compressed tar files, so there is really only 2 types of archive that you need to support: tar and zip. (bz2 by itself is not an archive format, it's just compression).
You can determine whether a given file is a tar file with tarfile.is_tarfile()
. This will also work on tar files compressed with gzip or bzip2 compression. Within a tar file you can determine whether a file is a directory using TarInfo.isdir()
or a file with TarInfo.isfile()
.
Similarly you can determine whether a file is a zip file using zipfile.is_zipfile()
. With zipfile
there is no method to distinguish directories from normal file, but files that end with /
are directories.
So, given a file name, you can do this:
import zipfile
import tarfile
filename = 'test.tgz'
if tarfile.is_tarfile(filename):
f = tarfile.open(filename)
for info in f:
if info.isdir():
file_type = 'directory'
elif info.isfile():
file_type = 'file'
else:
file_type = 'unknown'
print('{} is a {}'.format(info.name, file_type))
elif zipfile.is_zipfile(filename):
f = zipfile.ZipFile(filename)
for name in f.namelist():
print('{} is a {}'.format(name, 'directory' if name.endswith('/') else 'file'))
else:
print('{} is not an accepted archive file'.format(filename))
When run on a tar file with this structure:
(py2)[mhawke@localhost tmp]$ tar tvfz /tmp/test.tgz
drwxrwxr-x mhawke/mhawke 0 2016-02-29 12:38 x/
lrwxrwxrwx mhawke/mhawke 0 2016-02-29 12:38 x/4 -> 3
drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:14 x/3/
drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:14 x/3/4/
-rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:14 x/3/4/zzz
drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:13 x/2/
-rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/2/aa
drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:13 x/1/
-rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/1/abc
-rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/1/ab
-rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/1/a
The output is:
x is a directory
x/4 is a unknown
x/3 is a directory
x/3/4 is a directory
x/3/4/zzz is a file
x/2 is a directory
x/2/aa is a file
x/1 is a directory
x/1/abc is a file
x/1/ab is a file
x/1/a is a file
Notice that x/4
is "unknown" because it is a symbolic link.
There is no easy way, with zipfile
, to distinguish a symlink (or other file types) from a directory or normal file. The information is there in the ZipInfo.external_attr
attribute, but it's messy to get it back out:
import stat
linked_file = f.filelist[1]
is_symlink = stat.S_ISLNK(linked_file.external_attr >> 16L)