6

I want to extract all files with the pattern *_sl_H* from many tar.gz files, without extracting all files from the archives.

I found these lines, but it is not possible to work with wildcards (https://pymotw.com/2/tarfile/):

import tarfile
import os

os.mkdir('outdir')
t = tarfile.open('example.tar', 'r')
t.extractall('outdir', members=[t.getmember('README.txt')])
print os.listdir('outdir')

Does someone have an idea? Many thanks in advance.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
asator
  • 83
  • 1
  • 2
  • 7

2 Answers2

13

Take a look at TarFile.getmembers() method which returns the members of the archive as a list. After you have this list, you can decide with a condition which file is going to be extracted.

import tarfile
import os

os.mkdir('outdir')
t = tarfile.open('example.tar', 'r')
for member in t.getmembers():
    if "_sl_H" in member.name:
        t.extract(member, "outdir")

print os.listdir('outdir')
Alexander
  • 12,424
  • 5
  • 59
  • 76
  • +1 on the usage of tarfile, really the module to go. However I'm not sure of the performance when extracting only members of the tar.gz because the whole archive is zipped with gz and probably needs to extract in memory everything in order to access to the archive member. – Pierre-Selim Sep 21 '20 at 08:55
9

You can extract all files matching your pattern from many tar as follows:

  1. Use glob to get you a list of all of the *.tar or *.gz files in a given folder.

  2. For each tar file, get a list of the files in each tar file using the getmembers() function.

  3. Use a regular expression (or a simple if "xxx" in test) to filter the required files.

  4. Pass this list of matching files to the members parameter in the extractall() function.

  5. Exception handling is added to catch badly encoded tar files.

For example:

import tarfile
import glob
import re

reT = re.compile(r'.*?_sl_H.*?')

for tar_filename in glob.glob(r'\my_source_folder\*.tar'):
    try:
        t = tarfile.open(tar_filename, 'r')
    except IOError as e:
        print(e)
    else:
        t.extractall('outdir', members=[m for m in t.getmembers() if reT.search(m.name)])
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • I don't think the library supports deleting files from within a tar archive. To do this you would need to create a new tar file without the extracted files in it. – Martin Evans Mar 09 '16 at 08:48
  • Ok, thanks for the fast reply. So deleting the "active" tar.gz file after extracting the relevant files is not possible? Otherwise I will get storage problems. – asator Mar 09 '16 at 09:10
  • To delete the whole tarfile, try adding `t.close()` after the extract and before an `os.remove()` – Martin Evans Mar 09 '16 at 09:11
  • Perfect, thank you! One more thing. Is it possible to extract files with the pattern "_sl_HH" to the output directory, compress all files with the pattern "_sl_HV" in a new archive and then deleting the "active" tar.gz file? Since I am working on a normal PC, what about multicore? I have to process ~2TB in total on a HDD.. – asator Mar 09 '16 at 09:28
  • Take a look at [this question](http://stackoverflow.com/questions/2032403/how-to-create-full-compressed-tar-file-using-python) on how to create a tarfile. – Martin Evans Mar 09 '16 at 09:41