I have to loop through 30 zip folders, and each zip folder has 50,000 - 90,000 jpeg files. Ideally, I would loop through each zip folder because unzipping each folder would take too long. For each file, I need to open each file, extract key information from it, and store the information into a list. Based on How to do multithreading on a folder with several files?, I tried enabling multiprocessing to make things quicker, however, I can't figure it out. In my example below, I am trying to get it to work with one folder at the moment, and then I will need to figure out how to make it loop through all 30 zip folders.
import os
from zipfile import ZipFile
data_list = []
def image_processor(file):
with ZipFile("files101.zip") as zip_file:
with zip_file.open(file, "r") as img_file:
img_data = img_file.readlines(1) # data is available in beginning of each file
# Extract data #1
pattern_1 = r'IMG:\d{,3}'
if re.findall(pattern_1, str(img_data)):
img_extract = re.findall(pattern_1, str(img_data))[0]
else:
img_extract = np.nan
# Extract timestamp
time_pattern = r'Time:\s\d{2}-\d{2}-\d{4}\s\s\d{2}:\d{2}:\d{2}'
if re.findall(time_pattern, str(img_data)):
time_extract = re.findall(time_pattern, str(img_data))[0]
else:
time_extract = np.nan
# Create list
return data_list.append([img_extract, time_extract])
os.chdir(r"C:\\Users\\xxxxxx\\Desktop\\zip")
for folder in os.listdir():
file_list = ZipFile("files101.zip", "r").namelist()
with ProcessPool(processes=8) as pool:
pool.map(image_processor, file_list)
What happens is my code just runs forever like it does without enabling multiprocessing. If I need to do multi-threading, I have six cores. Any advice would be appreciated.