12

There are a few libraries used to extract archive files through Python, such as gzip, zipfile library, rarfile, tarfile, patool etc. I found one of the libraries (patool) to be especially useful due to its cross-format feature in the sense that it can extract almost any type of archive including the most popular ones such as ZIP, GZIP, TAR and RAR.

To extract an archive file with patool it is as easy as this:

patoolib.extract_archive( "Archive.zip",outdir="Folder1")

Where the "Archive.zip" is the path of the archive file and the "Folder1" is the path of the directory where the extracted file will be stored.

The extracting works fine. The problem is that if I run the same code again for the exact same archive file, an identical extracted file will be stored in the same folder but with a slightly different name (filename at the first run, filename1 at the second, filename11 at the third and so on.

Instead of this, I need the code to overwrite the extracted file if a file under a same name already exists in the directory.

This extract_archive function looks so minimal - it only have these two parameters, a verbosity parameter, and a program parameter which specifies the program you want to extract archives with.

Edits: Nizam Mohamed's answer documented that extract_archive function is actually overwriting the output. I found out that was partially true - the function overwrites ZIP files, but not GZ files which is what I am after. For GZ files, the function still generates new files.

Edits Padraic Cunningham's answer suggested using the master source . So, I downloaded that code and replaced my old patool library scripts with the scripts in the link. Here is the result:

os.listdir()
Out[11]: ['a.gz']

patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[12]: '.'

patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[13]: '.'

patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[14]: '.'

os.listdir()
Out[15]: ['a', 'a.gz', 'a1', 'a2']

So, again, the extract_archive function is creating new files everytime it is executed. The file archived under a.gz has a different name from a actually.

multigoodverse
  • 7,638
  • 19
  • 64
  • 106

5 Answers5

5

As you've stated, patoolib is intended to be a generic archive tool.

Various archive types can be created, extracted, tested, listed, compared, searched and repacked with patool. The advantage of patool is its simplicity in handling archive files without having to remember a myriad of programs and options.

Generic Extract Behaviour vs Specific Extract Behaviour

The problem here is that extract_archive does not expose the ability to modify the underlying default behaviour of the archive tool extensively.

For a .zip extension, patoolib will use unzip. You can have the desired behaviour of extracting the archive by passing -o as an option to the command line interface. i.e. unzip -o ... However, this is a specific command line option for unzip, and this changes for each archive utility.

For example tar offers an overwrite option, but no shortened command line equivalent as zip. i.e. tar --overwrite but tar -o does not have the intended effect.

To fix this issue you could make a feature request to the author, or use an alternative library. Unfortunately, the mantra of patoolib would require extending all extract utility functions to then implement the underlying extractors own overwrite command options.

Example Changes to patoolib

In patoolib.programs.unzip

def extract_zip (archive, compression, cmd, verbosity, outdir, overwrite=False):
    """Extract a ZIP archive."""
    cmdlist = [cmd]
    if verbosity > 1:
        cmdlist.append('-v')
    if overwrite:
        cmdlist.append('-o')
    cmdlist.extend(['--', archive, '-d', outdir])
    return cmdlist

In patoolib.programs.tar

def extract_tar (archive, compression, cmd, verbosity, outdir, overwrite=False):
    """Extract a TAR archive."""
    cmdlist = [cmd, '--extract']
    if overwrite:
        cmdlist.append('--overwrite')
    add_tar_opts(cmdlist, compression, verbosity)
    cmdlist.extend(["--file", archive, '--directory', outdir])
    return cmdlist

It's not a trivial change to update every program, each program is different!

Monkey patching overwrite behavior

So you've decided to not improve the patoolib source code... We can overwrite the behaviour of extract_archive to initially look for an existing directory, remove it, then call the original extract_archive.

You could include this code in your modules, if many modules require it, perhaps stick it __init__.py

import os
import patoolib
from shutil import rmtree


def overwrite_then_extract_archive(archive, verbosity=0, outdir=None, program=None):
    if outdir:
        if os.path.exists(outdir):
            shutil.rmtree(outdir)
    patoolib.extract_archive(archive, verbosity, outdir, program)

patoolib.extract_archive = overwrite_then_extract_archive

Now when we call extract_archive() we have the functionality of overwrite_then_extract_archive().

Matt Davidson
  • 728
  • 4
  • 9
  • I am not familiar with monkey patching or improving library source codes. Do you mean that if someone improves the source code, I could have instant access and download the improved version of the updated library? – multigoodverse Apr 16 '15 at 19:53
  • Ideally you would contribute the necessary changes to the library yourself. (It's totally understandable if you don't want to do that though!) If you include the last code segment before you use the `extract_archive` function, it will give you the desired overwrite behaviour. Look at the last line `patoolib.extract_archive = overwrite_then_extract_archive` it patches the previous behaviour with the overwrite behaviour. – Matt Davidson Apr 16 '15 at 19:56
  • 1
    I thought I'd just add a reference to a general answer regarding the nature of monkey patching: http://stackoverflow.com/questions/5626193/what-is-monkey-patch – OYRM Apr 23 '15 at 14:53
  • Since there are some potentially destructive consequences, I think it's worth pointing out that deleting of the entire output directory as done in proposed implementation of the monkey-patch, isn't quite the same since it also will delete any files in it that were not part of the archive, which isn't quite the same as just overwriting those that are in it. – martineau Jun 28 '15 at 12:44
2

If the functionality doesn't exist, you'll need to add it. An example of this would be to wrap the function with one of your own:

import os
from shutil import rmtree

def overwriting_extract_archive(zippath, outpath, **kwargs): 
    if os.path.exists(outpath):
        shutil.rmtree(outpath)
    patoolib.extract_archive(zippath, outdir=outpath, **kwargs)

If you want to check file-by-file and merge new output with existing output, that becomes a more complex problem, of course, but if it's just as you describe (run it a second time), this should work.

a p
  • 3,098
  • 2
  • 24
  • 46
  • I concur. The code is at https://github.com/wummel/patool/blob/c482bbd86192ccd65d5efa4a384bb657150d5347/patoolib/__init__.py#L448 and I was vaguely speculating that perhaps you could monkey-patch the database of command-line parameters for the back-end programs to add an "--overwrite" option to each format you care about (which provides this facility in the first place). The behavior the OP describes doesn't seem to be in the Python code anyway. – tripleee Apr 16 '15 at 18:46
  • @tripleee something like that might work, but monkeypatching is considered less 'Pythonic' than wrapping functions. I tend to think that simpler is better, too, and modifying libraries can come back to bite you later. – a p Apr 16 '15 at 18:49
  • 1
    @tripleee I think it's caused by this function: https://github.com/wummel/patool/blob/c482bbd86192ccd65d5efa4a384bb657150d5347/patoolib/util.py#L444 – 1.618 Apr 16 '15 at 18:49
2

Overwriting existing files while extracting an archive may leave the destination directory in inconsistent state if the extraction fails.

Removing destination directory before extraction may lead to loss of files if extraction fails.

I think the best aproach is, to extract into a temp directory and sync to the destination directory.

For this solution, module dirsync is required. But dirsync snycs only if mtime and ctime are newer by default, not by file size.

import os
import sys
from shutil import rmtree
from patoolib import extract_archive
from dirsync import sync

archive = ''
dst_dir = ''

try:
    tmp_dir = extract_archive(archive)
except Exception as e:
    print('extract_archive error {}'.format(e))
    sys.exit(1)
else:
    try:
        sync(tmp_dir,dst_dir,'sync',options=['modtime'])
    except Exception as e:
        print('updating {} from {} failed, error {}'.format(dst_dir,tmp_dir,e))
        sys.exit(1)
    else:
        sys.exit(0)
finally:
   if os.path.exists(tmp_dir):
       rmtree(tmp_dir)
Nizam Mohamed
  • 8,751
  • 24
  • 32
  • I see - it is working with you because you used a ZIP file. If you pass a GZ file, the function will not overwrite. That's a good discovery, but it still leaves the issue unsolved. – multigoodverse Apr 21 '15 at 21:30
2

Using the master source if you pass a directory using outdir it will overwrite including .gz files:

from patoolib import extract_archive

extract_archive("foo.tar.gz",verbosity=1,outdir=".")

You will see:

patool: ... /pathto/.foo.tar.gz extracted to `.'.

The only way it won't overwrite is if you don't pass a directory where extracting a second time you get something like:

 ...foo.tar.gz extracted to `foo-1.0.2.tar1' ...(local file exists).

Running from bash, 7z asks each time to confirm the overwrite:

In [9]: ls
foo.gz

In [10]: from patoolib import extract_archive

In [11]: extract_archive("foo.gz",verbosity=1,outdir=".")
patool: Extracting foo.gz ...
patool: running /usr/bin/7z e -o. -- foo.gz

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_IE.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

Processing archive: foo.gz

Extracting  foo

Everything is Ok

Size:       12
Compressed: 36
patool: ... foo.gz extracted to `.'.
Out[11]: '.'

In [12]: extract_archive("foo.gz",verbosity=1,outdir=".")
patool: Extracting foo.gz ...
patool: running /usr/bin/7z e -o. -- foo.gz

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_IE.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

Processing archive: foo.gz

file ./foo
already exists. Overwrite with 
foo?
(Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit? y
Extracting  foo

Everything is Ok

Size:       12
Compressed: 36
patool: ... foo.gz extracted to `.'.
Out[12]: '.'

In [13]: extract_archive("foo.gz",verbosity=1,outdir=".")
patool: Extracting foo.gz ...
patool: running /usr/bin/7z e -o. -- foo.gz

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_IE.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

Processing archive: foo.gz

file ./foo
already exists. Overwrite with 
foo?
(Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit? y
Extracting  foo

Everything is Ok

Size:       12
Compressed: 36
patool: ... foo.gz extracted to `.'.
Out[13]: '.'

In [14]: ls
foo  foo.gz

Extracting a tar.gz file:

In [1]: from patoolib import extract_archive

In [2]: for x in range(4):
            extract_archive("/home/padraic/Downloads/pycrypto-2.0.1.tar.gz",verbosity=1,outdir=".")
   ...:     
patool: Extracting /home/padraic/Downloads/pycrypto-2.0.1.tar.gz ...
patool: running /bin/tar --extract -z --file /home/padraic/Downloads/pycrypto-2.0.1.tar.gz --directory .
patool: ... /home/padraic/Downloads/pycrypto-2.0.1.tar.gz extracted to `.'.
patool: Extracting /home/padraic/Downloads/pycrypto-2.0.1.tar.gz ...
patool: running /bin/tar --extract -z --file /home/padraic/Downloads/pycrypto-2.0.1.tar.gz --directory .
patool: ... /home/padraic/Downloads/pycrypto-2.0.1.tar.gz extracted to `.'.
patool: Extracting /home/padraic/Downloads/pycrypto-2.0.1.tar.gz ...
patool: running /bin/tar --extract -z --file /home/padraic/Downloads/pycrypto-2.0.1.tar.gz --directory .
patool: ... /home/padraic/Downloads/pycrypto-2.0.1.tar.gz extracted to `.'.
patool: Extracting /home/padraic/Downloads/pycrypto-2.0.1.tar.gz ...
patool: running /bin/tar --extract -z --file /home/padraic/Downloads/pycrypto-2.0.1.tar.gz --directory .
patool: ... /home/padraic/Downloads/pycrypto-2.0.1.tar.gz extracted to `.'.

In [3]: ls
pycrypto-2.0.1/

Again all gets overwritten, the only explanation I can see is that whatever application gets called to unzip your .gz files by default does not overwrite or prompt but creates new files each time slightly changing the name.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
1

It seems I found a workaround to the problem of creating new files every time the extract_archive method of the patool library is executed. To be emphasized is the fact that the method is able to overwrite/skip files that have been previously extracted for other archive extensions, but not for Gun Zipped files.

I noticed that when any Gun Zipped file (.gz) is extracted, the extracted file has the same name as the archive, but without any extension. To illustrate it better, if you change the name from X.gz to Y.gz, and then you extract the archive, the extracted file will have the name "Y". Therefore, I was able to implement a simple conditional:

import os,patoolib
if "name" not in os.listdir():
    patoolib.extract_archive("name.gz",outdir="C:\")

This seems to solve my problem.

multigoodverse
  • 7,638
  • 19
  • 64
  • 106