-1

I have to extract hundreds of tar.bz files each with size of 5GB. So tried the following code:

import tarfile
from multiprocessing import Pool

files = glob.glob('D:\\*.tar.bz') ##All my files are in D
for f in files:

   tar = tarfile.open (f, 'r:bz2')
   pool = Pool(processes=5)

   pool.map(tar.extractall('E:\\') ###I want to extract them in E
   tar.close()

But the code has type error: TypeError: map() takes at least 3 arguments (2 given)

How can I solve it? Any further ideas to accelerate extracting?

Beau
  • 21
  • 1
  • 3
  • 1
    I'm betting your problem here is the I/O rather than the code. The `map` error is clear: you have to provide a function and the list of parameters to that function. Your case: `map(extractall, [list, of, files])` – xbello Sep 21 '14 at 15:12
  • How can I provide the destination directory? map(extractall, [list, of, files]) – Beau Sep 21 '14 at 15:16
  • Different targets to each file? `[(list, dest), (of, dest2), (files, dest3)]`. Same target? Create a `functools.partial` for the `extractall`. – xbello Sep 21 '14 at 15:18
  • Actually same target to each file. – Beau Sep 21 '14 at 15:20
  • 1
    possible duplicate of [How can I process a tarfile with a Python multiprocessing pool?](http://stackoverflow.com/questions/8250264/how-can-i-process-a-tarfile-with-a-python-multiprocessing-pool) – Luka Rahne Sep 21 '14 at 15:44
  • Why use python for this at all? If you've got cygwin, you have `xargs -P`, or (shudder) GNU parallel. – Charles Duffy Sep 21 '14 at 17:59

2 Answers2

3

Define a function that extract a single tar file. Pass that function and a tar file list to multiprocessing.Pool.map:

from functools import partial
import glob
from multiprocessing import Pool
import tarfile


def extract(path, dest):
    with tarfile.open(path, 'r:bz2') as tar:
        tar.extractall(dest)

if __name__ == '__main__':
    files = glob.glob('D:\\*.tar.bz')
    pool = Pool(processes=5)
    pool.map(partial(extract, dest='E:\\'), files)
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • 1
    Also, you could have a look to concurrent.futures.ProcessPoolExecutor() https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor – DevLounge Sep 21 '14 at 18:01
0

You need to change pool.map(tar.extractall('E:\\') to something like pool.map(tar.extractall(),"list_of_all_files")

Note that map() takes 2 argument first is a function , second is a iterable , and Apply function to every item of iterable and return a list of the results.

Edit : you need to pass a TarInfo object into the other process :

def test_multiproc():
    files = glob.glob('D:\\*.tar.bz2')
    pool  = Pool(processes=5)
    result = pool.map(read_files, files)


def read_files(name):

 t = tarfile.open (name, 'r:bz2')
 t.extractall('E:\\')
 t.close()

>>>test_multiproc()
Mazdak
  • 105,000
  • 18
  • 159
  • 188