1

I am using 64-bit 3.6.3 python on a 64-bit windows 10 laptop, with 12 gigs of RAM.

I have python code that can extract a zip file (tar.gz). If I use the code, it takes a really long time (~1.5 hour) but if I unzip it directly using 7zip it takes less than 5 minutes, so I am guessing there is something impeding the processing power of python.

I am trying to run this code https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/1_notmnist.ipynb

for convenience, here are the specific commands for unzipping.

import tarfile

tar = tarfile.open(filename)
sys.stdout.flush()
tar.extractall(data_root)
tar.close()

Here is full code.

from __future__ import print_function
import os
import sys
import tarfile
from six.moves.urllib.request import urlretrieve


# Config the matplotlib backend as plotting inline in IPython

url = 'https://commondatastorage.googleapis.com/books1000/'
last_percent_reported = None
data_root = '.'  # Change me to store data elsewhere


def download_progress_hook(count, blockSize, totalSize):
    """A hook to report the progress of a download. This is mostly intended for users with
    slow internet connections. Reports every 5% change in download progress.
    """
    global last_percent_reported
    percent = int(count * blockSize * 100 / totalSize)

    if last_percent_reported != percent:
        if percent % 5 == 0:
            sys.stdout.write("%s%%" % percent)
            sys.stdout.flush()
        else:
            sys.stdout.write(".")
            sys.stdout.flush()

        last_percent_reported = percent


def maybe_download(filename, expected_bytes, force=False):
    """Download a file if not present, and make sure it's the right size."""
    dest_filename = os.path.join(data_root, filename)
    if force or not os.path.exists(dest_filename):
        print('Attempting to download:', filename)
        filename, _ = urlretrieve(url + filename, dest_filename, reporthook=download_progress_hook)
        print('\nDownload Complete!')
    statinfo = os.stat(dest_filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', dest_filename)
    else:
        raise Exception(
            'Failed to verify ' + dest_filename + '. Can you get to it with a browser?')
    return dest_filename


train_filename = maybe_download('notMNIST_large.tar.gz', 247336696)
test_filename = maybe_download('notMNIST_small.tar.gz', 8458043)

num_classes = 10

def maybe_extract(filename, force=False):
  root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz
  if os.path.isdir(root) and not force:
    # You may override by setting force=True.
    print('%s already present - Skipping extraction of %s.' % (root, filename))
  else:
    print('Extracting data for %s. This may take a while. Please wait.' % root)
    tar = tarfile.open(filename)
    sys.stdout.flush()
    tar.extractall(data_root)
    tar.close()
  data_folders = [
    os.path.join(root, d) for d in sorted(os.listdir(root))
    if os.path.isdir(os.path.join(root, d))]
  if len(data_folders) != num_classes:
    raise Exception(
      'Expected %d folders, one per class. Found %d instead.' % (
        num_classes, len(data_folders)))
  print(data_folders)
  return data_folders

train_folders = maybe_extract(train_filename)
test_folders = maybe_extract(test_filename)

I am using 64-bit 3.6.3 python on a 64-bit windows 10 laptop, with 12 gigs of RAM.

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • 2
    I can not reproduce this problem. Please consider [To create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) – Elis Byberi Nov 20 '17 at 13:07
  • Does `tarfile` automatically unzip too? – Mad Physicist Nov 20 '17 at 13:07
  • What is `data_root`? – lxop Nov 20 '17 at 13:07
  • @ElisByberi I added a code sample which only inlcudes the most important parts. I removed the calls for numpy, scipy, iphyon, etc Try now. – SantoshGupta7 Nov 20 '17 at 13:18
  • @MadPhysicist yes, if I run the python code, it unzips the file. It just takes a really long time. – SantoshGupta7 Nov 20 '17 at 13:18
  • Possible duplicate of [Tarfile in Python: Can I untar more efficiently by extracting only some of the data?](https://stackoverflow.com/questions/26067471/tarfile-in-python-can-i-untar-more-efficiently-by-extracting-only-some-of-the-d) – Elis Byberi Nov 20 '17 at 13:20
  • @lxop it's a place holder to store the files some where. Please see the updated prompt where I include all the code you need to run and test it. – SantoshGupta7 Nov 20 '17 at 13:21
  • @ElisByberi no I am looking to extract all the files, not just some of them. My python code runs much slower (5 mins vs 1 hour) than directly unzipping them. So I thin something might be impeding the python compiler. – SantoshGupta7 Nov 20 '17 at 13:25
  • @MadPhysicist `tarfile` does not unzip without calling `extractall()` method. – Elis Byberi Nov 20 '17 at 15:42

1 Answers1

1

Tarfile module is implemented in pure Python. 7zip is implemented in C++.
Tarfile in Python is 60/5 = 12 times slower than 7zip.

Extracting too many files is normally slow.
To be honest, Tarfile is doing a pretty good job. There are over 500000 files to be extracted.

Elis Byberi
  • 1,422
  • 1
  • 11
  • 20
  • For me it takes a few seconds to create an archive from files but much longer to extract the files. The operation of extracting files is extremely simple, read offset and size, move the file pointer and copy the data from one file to another. Really should not matter if it is implemented in C or Python as the bottleneck is supposed to be the disk operations. – user2555515 Dec 23 '21 at 00:36
  • @user2555515 Quoting OP "If I use the code, it takes a really long time (~1.5 hour) but if I unzip it directly using 7zip it takes less than 5 minutes." If 7zip takes 5 minutes, it is the tarfile Python module at fault. There are many questions like this one, e.g. https://stackoverflow.com/a/45621522/2430448 – Elis Byberi Dec 23 '21 at 01:48