0

I have the below which looks for files within a directory, and opens them in binary format before converting to hex.

Now it runs, but I want to make it faster, current takes 4 minutes to read 100k files but doesn't multi thread over multiple processors, just wondering any idea?

def binary_file_reader(file_data):
    with open(file_data, 'rb') as binary_file_data:
        binary_file_data = binary_file_data.read()
        binary_data = binascii.hexlify(binary_file_data)
        binary_data = binary_data.decode("utf-8")
    return binary_data

if __name__ == "__main__":
    success_files_counted = 0
    unsuccess_files_counted = 0
    read_file_names = []
    device_directory = os.getcwd()

    for r, d, f in os.walk(device_directory):
        for file in f:
            try:
                file_data = os.path.join(r, file)
                binary_data = binary_file_reader(file_data)
                read_file_names.append("Successful: "+r+file)
                success_files_counted+=1       
            except IOError:
                read_file_names.append("Unsuccessful: "+r+file)
                unsuccess_files_counted+=1
jsbueno
  • 99,910
  • 10
  • 151
  • 209
RajB_007
  • 55
  • 8
  • could you please fix the indentation for your code? It will raise a sintax error as is. – jsbueno Mar 01 '21 at 15:06
  • (hint: don't try to indent the lines manually after pasting here: either use the `{}` format button or use three backticks - \'\'\' to delimit a code block.) – jsbueno Mar 01 '21 at 15:09
  • Yep sure, was odd. Now can any ideas on the question. Indenting is the least of my issues lol – RajB_007 Mar 01 '21 at 16:42
  • Do you know whether a significant amount of the time (if not most) is getting the list of files? Tricky to measure since [os.walk() is much faster after the first run due to page caching](https://stackoverflow.com/questions/28339263/is-os-walk-much-faster-after-the-first-run-due-to-page-caching). You can test by timing twice--the 2nd run of os.walk will use cached pages. If the 2nd run is significantly less than 4 minutes then we know the time is due to os.walk obtaining the directory structure. – DarrylG Mar 01 '21 at 17:49
  • Definitely have tried it. Ran it without opening as a Binary function, and it can walk and append in 20 seconds. When it needs to open each file as a binary and go into the function it takes significantly longer. – RajB_007 Mar 01 '21 at 17:58
  • 1
    @RajB_007--to check if it's IO limited or CPU limited can you comment out the lines in binary_file_reader correspoinding to ` binary_data = binascii.hexlify(binary_file_data) binary_data = binary_data.decode("utf-8")` and just return binary_file_data. I tried a multithreaded version on my machine so want to check if this would help with this test (didn't help on my machine since os.walk is slow the first time). – DarrylG Mar 01 '21 at 18:28
  • @DarrylG so just tried my results: No Binary Function - 15s to os walk Removed the lines you suggested - 1m 6s to walk and open as binary files Whole code as posted - 3m 20s to walk, open as binary, convert to hex – RajB_007 Mar 01 '21 at 18:56
  • 1
    @RajB_007--added an answer to try. In main.py you can simulate data or comment this section out to use your real data. – DarrylG Mar 01 '21 at 21:09

1 Answers1

1

Python concurrent.futures modules allows to types parallel processing

  • Multi-threading (for I/O bound tasks)
  • Multi-processes (for CPU bound taks)

Results of valuating both for speedup of your task using 10K files

  • Non-Parallel and Multi-threaded about the same time
  • Multi-processes version about 2X faster

Code

Note: Placed multiprocessing code in separate file due to issues with Windows Jupyter notebook. This is not necessary for other environments.

File: multi_process_hexify.py (all the processing code)

import os
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor
from time import time
import binascii

def all_files(directory):
    ' Generator for list of files starting with directory '
    for r, d, f in os.walk(directory):
        for name in f:
            yield os.path.join(r, name)

def create_test_files(folder_path, number_files, size):
    ' Create files with random binary data '
    # Create files folder (if doesn't exist)
    Path(folder_path).mkdir(parents=True, exist_ok=True) 

    # Create data in folder
    for i in range(number_files):
        data = os.urandom(size)
        with open(os.path.join(folder_path, f'{i}.txt'), 'wb') as f:
            f.write(data)
       
def binary_file_reader(file_path):
    with open(file_path, "r+b") as binary_file_data:
            binary_file_data = binary_file_data.read()
            binary_data = binascii.hexlify(binary_file_data)
            binary_data = binary_data.decode("utf-8")
    return binary_data

def process_file(file_path):
    try:
        binary_data = binary_file_reader(file_path)
  
        return f"Successful: {file_path}"
    
    except IOError:
        return f"Unsuccessful: {file_path}"
  
def get_final(responses):
    ' Creates the final result string to return to user '
    responses = list(responses)
    successful = sum(1 for x in responses if x[0]=='S')  # Count successful
    unsuccessful = len(responses) - successful           # Count unsuccessful
    return responses, successful, unsuccessful

def main_non_parallel(device_directory):
    ' Unthreaded processing using process_file '
    start = time()
    responses = [process_file(file_path) for file_path in all_files(device_directory)]

    result = get_final(responses)
    end = time() - start
    
    print(f"Processed main_unthreaded in {end:.4f} sec")
    return result

def main_multithreaded(device_directory):
    # https://stackoverflow.com/questions/42074501/python-concurrent-futures-processpoolexecutor-performance-of-submit-vs-map/42096963#42096963
    ' Multithreaded processing using process_file '
    start = time()
    with ThreadPoolExecutor() as executor:
        futures = executor.map(process_file, all_files(device_directory), chunksize = 1000)
    
    result = get_final(futures)

    end = time() - start

    print(f"Processed main_multithreaded in {end:.4f} sec")

    return result

def main_multiprocessing(device_directory):
    ' Multi processing using process_file '
    start = time()
    files = list(all_files(device_directory))
    with ProcessPoolExecutor() as executor:
        futures = executor.map(process_file, files, chunksize = 1000)

    result = get_final(futures)
    
    end = time() - start

    print(f"Processed main_multiprocessing in {end:.4f} sec")
    return result 

Test

File: main.py

import os
import multi_process_hexify

if __name__ == '__main__':
    # Directory for files
    device_directory =  os.path.join(os.getcwd(), 'test_dir')
    
    # Create Data
    multi_process_hexify.create_test_files(device_directory, 100, 100)
    
    # Perform Non-Parallel Processing
    read_file_names_unthreaded, successful, unsucessful  = multi_process_hexify.main_non_parallel(device_directory)
    print(f'Successful {successful}, Unsuccessfuil {unsucessful}')
    print()
    
    # Perform Multi Threaded Processing
    read_file_names_threaded, successful, unsucessful  = multi_process_hexify.main_multithreaded(device_directory)
    print(f'Successful {successful}, Unsuccessfuil {unsucessful}')
    print()

     # Perform Multi Processes Processing
    read_file_names_multiprocessing, successful, unsucessful  = multi_process_hexify.main_multiprocessing(device_directory)
    print(f'Successful {successful}, Unsuccessfuil {unsucessful}')
    
    # Confirm all three methods produce the same result
    print(read_file_names_unthreaded == read_file_names_threaded == read_file_names_multiprocessing)

Output

Processed main_unthreaded in 2.6610 sec
Successful 10000, Unsuccessfuil 0

Processed main_multithreaded in 3.2250 sec
Successful 10000, Unsuccessfuil 0

Processed main_multiprocessing in 1.2241 sec
Successful 10000, Unsuccessfuil 0
True
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • @DarryIG oh wow that is amazing. im in awe, thank you ever so much it works so well. Ill be reading around this for sure! – RajB_007 Mar 01 '21 at 21:34
  • 1
    @RajB_007-glad I could help. Did you have a chance to use your actual data? If so, how was your speed up (if any)? – DarrylG Mar 01 '21 at 21:36
  • @DarryIG still combining what I had with yours. I will comment in 2hrs tops once I finish work to test it. Again thank you! – RajB_007 Mar 01 '21 at 21:39
  • @DarryIG I said it once, say it again thanks you legend. Results below and wow its fast! Starting Folder: /Users Processed main_unthreaded in 83.5037 sec, Successful 94241, Unsuccessfuil 5109 Processed main_multithreaded in 51.2093 sec, Successful 94259, Unsuccessfuil 5109 Processed main_multiprocessing in 19.9108 sec, Successful 94259, Unsuccessfuil 5109 – RajB_007 Mar 01 '21 at 22:43
  • 1
    @RajB_007--great that it worked out. To be honest, frequently it doesn't work out (i.e. multiprocessing is actually slower) due to the overhead (i.e. starting processes, transferring data between the main process and other processes, etc.). – DarrylG Mar 01 '21 at 23:14
  • @DarryIG ahh ok ill keep that in mind, still a legend dude! – RajB_007 Mar 02 '21 at 00:22