9

I have an archive which I do not want to extract but check for each of its contents whether it is a file or a directory.

os.path.isdir and os.path.isfile do not work because I am working on archive. The archive can be anyone of tar,bz2,zip or tar.gz(so I cannot use their specific libraries). Plus, the code should work on any platform like linux or windows. Can anybody help me how to do it?

Sam
  • 113
  • 1
  • 1
  • 9
  • no, because the archive can be of any type (there are so many types of archives) – Sam Feb 29 '16 at 00:23
  • 2
    This should be possible in theory by parsing the file... but why can't you extract them? – timgeb Feb 29 '16 at 00:24
  • I have to upload the archive and extract it then but before uploading, I have to implement a few checks. – Sam Feb 29 '16 at 00:26
  • Maybe one of these get your started: – timgeb Feb 29 '16 at 00:30
  • [click](http://stackoverflow.com/questions/2018512/reading-tar-file-contents-without-untarring-it-in-python-script) or [click](http://stackoverflow.com/questions/33592099/how-do-i-list-contents-of-a-gz-file-without-extracting-it-in-python) or [click](http://stackoverflow.com/questions/3369732/how-to-see-the-content-of-a-particular-file-in-tar-gz-archive-without-unzipping) – timgeb Feb 29 '16 at 00:30
  • Also of course you could check the filetype and then use the respective library. – timgeb Feb 29 '16 at 00:31
  • there are so many types of archives that it is just not possible to write separate code for each of them. Currently, I try to use archive.getall_members() and try to iterate over it to check that whether its contents are a file or a folder but I do not know how? – Sam Feb 29 '16 at 00:36
  • 1
    there are so many types of archives that it is just not possible to write separate code for each of them <- try to apply the strategy pattern. Most of the could would be shared for the archive types, with minimal changes for the different library calls. – timgeb Feb 29 '16 at 00:38
  • I did not get what you are trying to say.. can you please elaborate it? – Sam Feb 29 '16 at 00:42
  • just a typo, I meant "code", not "could". Too late to edit. – timgeb Feb 29 '16 at 06:27

4 Answers4

17

You've stated that you need to support "tar, bz2, zip or tar.gz". Python's tarfile module will automatically handle gz and bz2 compressed tar files, so there is really only 2 types of archive that you need to support: tar and zip. (bz2 by itself is not an archive format, it's just compression).

You can determine whether a given file is a tar file with tarfile.is_tarfile(). This will also work on tar files compressed with gzip or bzip2 compression. Within a tar file you can determine whether a file is a directory using TarInfo.isdir() or a file with TarInfo.isfile().

Similarly you can determine whether a file is a zip file using zipfile.is_zipfile(). With zipfile there is no method to distinguish directories from normal file, but files that end with / are directories.

So, given a file name, you can do this:

import zipfile
import tarfile

filename = 'test.tgz'

if tarfile.is_tarfile(filename):
    f = tarfile.open(filename)
    for info in f:
        if info.isdir():
            file_type = 'directory'
        elif info.isfile():
            file_type = 'file'
        else:
            file_type = 'unknown'
        print('{} is a {}'.format(info.name, file_type))

elif zipfile.is_zipfile(filename):
    f = zipfile.ZipFile(filename)
    for name in f.namelist():
         print('{} is a {}'.format(name, 'directory' if name.endswith('/') else 'file'))

else:
    print('{} is not an accepted archive file'.format(filename))

When run on a tar file with this structure:

(py2)[mhawke@localhost tmp]$ tar tvfz /tmp/test.tgz
drwxrwxr-x mhawke/mhawke     0 2016-02-29 12:38 x/
lrwxrwxrwx mhawke/mhawke     0 2016-02-29 12:38 x/4 -> 3
drwxrwxr-x mhawke/mhawke     0 2016-02-28 21:14 x/3/
drwxrwxr-x mhawke/mhawke     0 2016-02-28 21:14 x/3/4/
-rw-rw-r-- mhawke/mhawke     0 2016-02-28 21:14 x/3/4/zzz
drwxrwxr-x mhawke/mhawke     0 2016-02-28 21:13 x/2/
-rw-rw-r-- mhawke/mhawke     0 2016-02-28 21:13 x/2/aa
drwxrwxr-x mhawke/mhawke     0 2016-02-28 21:13 x/1/
-rw-rw-r-- mhawke/mhawke     0 2016-02-28 21:13 x/1/abc
-rw-rw-r-- mhawke/mhawke     0 2016-02-28 21:13 x/1/ab
-rw-rw-r-- mhawke/mhawke     0 2016-02-28 21:13 x/1/a

The output is:

x is a directory
x/4 is a unknown
x/3 is a directory
x/3/4 is a directory
x/3/4/zzz is a file
x/2 is a directory
x/2/aa is a file
x/1 is a directory
x/1/abc is a file
x/1/ab is a file
x/1/a is a file

Notice that x/4 is "unknown" because it is a symbolic link.

There is no easy way, with zipfile, to distinguish a symlink (or other file types) from a directory or normal file. The information is there in the ZipInfo.external_attr attribute, but it's messy to get it back out:

import stat

linked_file = f.filelist[1]
is_symlink = stat.S_ISLNK(linked_file.external_attr >> 16L)
mhawke
  • 84,695
  • 9
  • 117
  • 138
0

I got the answer. It is that we can use two commands: archive.getall_members() and archive.getfile_members().

We iterate over each of them and store the file/folder names in two arrays a1(contains file/folder names) and a2(contains file names only). If both the arrays contain that element, then it is a file otherwise it is a folder.

Sam
  • 113
  • 1
  • 1
  • 9
0

You can use the string.endswith(string) method to check whether it has the proper file-name extension:

filenames = ['code.tar.gz', 'code2.bz2', 'code3.zip']
fileexts = ['.tar.gz', '.bz2', '.zip']

def check_extension():
    for name in filenames:
        for ext in fileexts:
            if name.endswith(ext):
                print ('The file: ', name, ' has the extension: ', ext)


check_extension()

which outputs:

The file:  code.tar.gz  has the extension:  .tar.gz
The file:  code2.bz2  has the extension:  .bz2
The file:  code3.zip  has the extension:  .zip

You would have to create a list of the file extensions for each and every archive file-type you'd want to check against, and would need to load in the file-name into a list where you can easily execute the check, but I think this would be a fairly effective way to solve your issue.

Sean Pianka
  • 2,157
  • 2
  • 27
  • 43
  • I am checking the file type of the contents of the archive and not that of the archive itself. Please read my solution. It works. – Sam Feb 29 '16 at 01:40
  • Apologies for my mistake, my interpretation of your question was wrong. – Sean Pianka Feb 29 '16 at 02:40
0

You can use the python-magic module and parse it's output.

[root@jasonralph ~]# yum install python-pip

[root@jasonralph ~]# pip install python-magic

[root@jasonralph ~]# cat py_file_check.py
#!/usr/bin/python

import magic
print magic.from_file('jason_ralph_org_20160215.tar.gz')

[root@jasonralph ~]# file jason_ralph_org_20160215.tar.gz
jason_ralph_org_20160215.tar.gz: gzip compressed data, from Unix, last   modified: Mon Feb 29 01:33:25 2016
> [root@jasonralph ~]# python py_file_check.py
>         gzip compressed data, from Unix, last modified: Mon Feb 29 01:33:25 2016
jaysunn
  • 87
  • 3
  • 12
  • I am checking the file type of the contents of the archive and not that of the archive itself. Please read my solution. It works – Sam Feb 29 '16 at 01:55