0

I have to loop through 30 zip folders, and each zip folder has 50,000 - 90,000 jpeg files. Ideally, I would loop through each zip folder because unzipping each folder would take too long. For each file, I need to open each file, extract key information from it, and store the information into a list. Based on How to do multithreading on a folder with several files?, I tried enabling multiprocessing to make things quicker, however, I can't figure it out. In my example below, I am trying to get it to work with one folder at the moment, and then I will need to figure out how to make it loop through all 30 zip folders.

import os
from zipfile import ZipFile

data_list = []
 
def image_processor(file):
    with ZipFile("files101.zip") as zip_file:
        with zip_file.open(file, "r") as img_file:
            img_data = img_file.readlines(1) # data is available in beginning of each file
            
            # Extract data #1
            pattern_1 = r'IMG:\d{,3}'
            if re.findall(pattern_1, str(img_data)):
                img_extract = re.findall(pattern_1, str(img_data))[0]
            else:
                img_extract = np.nan

            # Extract timestamp
            time_pattern = r'Time:\s\d{2}-\d{2}-\d{4}\s\s\d{2}:\d{2}:\d{2}'
            if re.findall(time_pattern, str(img_data)):
                time_extract = re.findall(time_pattern, str(img_data))[0]
            else:
                time_extract = np.nan

            # Create list   
            return data_list.append([img_extract, time_extract])

os.chdir(r"C:\\Users\\xxxxxx\\Desktop\\zip")
for folder in os.listdir():
    file_list = ZipFile("files101.zip", "r").namelist()

    with ProcessPool(processes=8) as pool:
        pool.map(image_processor, file_list)

What happens is my code just runs forever like it does without enabling multiprocessing. If I need to do multi-threading, I have six cores. Any advice would be appreciated.

DataNoob7
  • 191
  • 1
  • 8

1 Answers1

0

You are missing several imports. But these major items I notice right away are:

  1. for folder in os.listdir(): You are looping on every file and directory in the current directory yet the loop makes no reference to any of these files/directories; you are processing files101.zip repeatedly.
  2. You appear to be running under Windows based on your chdir command. Code that creates new processes must within a if __name__ == '__main__': block.
  3. Each process in a processing pool has its own address space and would be appending to different instances of data_list. Returning data_list back to the main process and having the main process appending all the return values to a master list would work, but would have to ensure that your image_processor function starts out with an empty data_list for every call.

I am assuming you want to process all the files with extension .zip in the current directory (of which there are approximately 30). I would modify your processing so that each submitted task to the processing pool's unit of work is not a single file within a zip archive (in which case you would be submitting 30 * 50,000 tasks), but rather a whole archive. So your main processing function is no longer image_processor but rather zip_processor. I have made a few other changes to this file, which make sense to me (I hope I did not break anything):

import os
import glob
from zipfile import ZipFile
import re
import numpy as np
from multiprocessing import Pool

def zip_processor(zipfile):
    with ZipFile(zipfile) as zip_file:
        data_list = []
        for file in zip_file.namelist():
            with zip_file.open(file, "r") as img_file:
                # take element 0 from returned list and convert to a string
                img_data = img_file.readlines(1)[0].decode() # data is available in beginning of each file
                # Extract data #1
                pattern_1 = r'IMG:\d{,3}'
                # why do findall if you are only using the first occurrence?
                m = re.search(pattern1, img_data)
                img_extract = m.group(0) if m else np.nan
    
                # Extract timestamp
                time_pattern = r'Time:\s\d{2}-\d{2}-\d{4}\s\s\d{2}:\d{2}:\d{2}'
                m = re.search(time_pattern, img_data)
                time_extract = m.group(0) if m else np.nan
    
                # Create list   
                data_list.append([img_extract, time_extract])
        return data_list
            

# required for Windows:
if __name__ == '__main__':
    os.chdir(r"C:\\Users\\xxxxxx\\Desktop\\zip")

    # Default pool size:
    with Pool() as pool:
        results = pool.imap(zip_processor, glob.iglob('*.zip'))
        data_list = []
        for result in results:
            data_list.extend(result)

Now since there is a lot of I/O involved, this might run very well using multithreading, in which case a larger pool size would be advantageous. Make the following changes:

#from multiprocessing import Pool
from multiprocessing.pool import ThreadPool
... # etc.

if __name__ == '__main__':
    os.chdir(r"C:\\Users\\xxxxxx\\Desktop\\zip")

    zip_list = glob.glob('*.zip')
    with ThreadPool(len(zip_list)) as pool:
        results = pool.imap(zip_processor, zip_list)
        ... # etc.
Booboo
  • 38,656
  • 3
  • 37
  • 60
  • Hi Booboo, thank you for your feedback I will go through it and report back. One question I have, regarding ```img_file.readlines(1)[0].decode()```. In my tests, I was able to use ```img_file.readlines(1)``` and then obtain the necessary data. Why is ```[0].decode()``` necessary? – DataNoob7 Jul 10 '21 at 17:59
  • You must be running Python 2, then. On Python 3, `img_file.readlines(1)[0]` would return a *byte string*, not a character string (in Python 2, there is no distinction between `'abc'` and `b'abc'`, but that is not the case for Python 3). And so when you do `re.search(pattern1, img_data)` where `pattern1` is class `str` and `img_data` is class `bytes`, you would get an error. Calling `decode` converts from a bytes string to a unicode string using the current default encoding (usually utf8). If you are Python 2, omit the call to `decode()` if you don't want *unicode* results for your matches. – Booboo Jul 10 '21 at 18:42
  • The reason why if have `img_file.readlines(1)[0]` instead of `img_file.readlines(1)` is because `readlines` returns a `list` and I am only interested in doing a pattern match against that first element. Your code converted the list to a string so you were doing a pattern match against that converted string that began with '[' and ended with ']'. So if the first element had value `'abc'`, I would be doing a pattern match just against the string `'abc'` but you would be doing a pattern match against the string `"['abc']"` because `str(['abc'])` == `"['abc']"`. It's silly even if it works. – Booboo Jul 10 '21 at 18:58
  • Hi Booboo, I implemented your code. I am using Python 3. I get an error when I try to use ```.decode```. It's fine as ```img_file.readlines(1)[0]``` works as per your comment above. I have been testing with three zip folders. The multiprocessing seems to run infinitely unfortunately. The multithreading code works, but it is slower than me applying a for loop to extract the data. I am told there will be 300+ zip files (50-90k images per) instead of the original 30 (50 - 90k images per), so mutli-threading would be very nice to speed this up. – DataNoob7 Jul 12 '21 at 14:57
  • Don’t call decode but I would then think you would need to use as your regex pattern a byte string. And I do think that your problem is better suite for multithreading as I indicated in the second code example. – Booboo Jul 12 '21 at 18:37
  • I agree about the multithreading, but I implement it, it doesn't make anything faster. Actually makes it slower than a simple for loop. I add this to your multithreading example (essentially the last part of the multiprocessing example: ``` #line 1 data_list = []; #line 2 for result in results: #line 3: data_list.extend(result)``` – DataNoob7 Jul 12 '21 at 20:52
  • Unfortunately, I am away for a few days with no computer access other than my phone and won’t be able to look at this until I get back. – Booboo Jul 12 '21 at 20:57
  • That's fine, any help will be appreciated when you get back. As an aside - I noticed when I tried the multi-processing method, I get ```AttributeError: Can't get attribute 'zip_processor' on ''' in anaconda prompt (didn't see it previously). Apparently it is a built in issue, and the work around is to put zip_processor into .py and import it. Now I am getting ```cannot pickle "module" object``` I will go back trying to figure out why the multithreading example runs slower than a regular for loop as that at least runs. – DataNoob7 Jul 13 '21 at 13:48
  • I got the multiprocessing to work - https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror see my response to the second answer in case you are curious. I am still interesting in multithreading, but it seems my problem was better suited for multiprocessing instead. I appreciate your help Booboo! – DataNoob7 Jul 14 '21 at 17:58
  • So you must have been running under something like Jupyter Notebook or iPython or interactive Python. Multithreading will not be performant if there is too much contention for the Global Interpreter Lock. – Booboo Jul 14 '21 at 20:20