0

I'm trying to calculate hash for files to check if any changes are made. i have Gui and some other observers running in the event loop. So, i decided to calculate hash of files [md5/Sha1 which ever is faster] asynchronously.

Synchronous code :

import hashlib
import time


chunk_size = 4 * 1024

def getHash(filename):
    md5_hash = hashlib.md5()
    with open(filename, "rb") as f:
        for byte_block in iter(lambda: f.read(chunk_size), b""):
            md5_hash.update(byte_block)
        print("getHash : " + md5_hash.hexdigest())

start = time.time()
getHash("C:\\Users\\xxx\\video1.mkv")
getHash("C:\\Users\\xxx\\video2.mkv")
getHash("C:\\Users\\xxx\\video3.mkv")
end = time.time()

print(end - start)

Output of synchronous code : 2.4000535011291504

Asynchronous code :

import hashlib
import aiofiles
import asyncio
import time


chunk_size = 4 * 1024

async def get_hash_async(file_path: str):
    async with aiofiles.open(file_path, "rb") as fd:
        md5_hash = hashlib.md5()
        while True:
            chunk = await fd.read(chunk_size)
            if not chunk:
                break
            md5_hash.update(chunk)
        print("get_hash_async : " + md5_hash.hexdigest())

async def check():
    start = time.time()
    t1 = get_hash_async("C:\\Users\\xxx\\video1.mkv")
    t2 = get_hash_async("C:\\Users\\xxx\\video2.mkv")
    t3 = get_hash_async("C:\\Users\\xxx\\video3.mkv")
    await asyncio.gather(t1,t2,t3)
    end = time.time()
    print(end - start)

loop = asyncio.get_event_loop()
loop.run_until_complete(check())

Output of asynchronous code : 27.957366943359375

am i doing it right? or, are there any changes to be made to improve the performance of the code?

Thanks in advance.

Manthri Anvesh
  • 95
  • 2
  • 12
  • 2
    Depending on the physical device they are on reading large files in parallel can be much slower than reading them one after the other dues to seek times. – Klaus D. Jun 20 '19 at 12:49
  • @KlausD. Yes, i have played with the code changing the chunk size and figured out that larger the chunk size faster the async code is, where it doesn't make any difference with synchronous code. – Manthri Anvesh Jun 20 '19 at 13:00
  • Better use threads. Now you are using a thread pool under the hood just hidden behind the ``async``/``await``. Use a `concurrent.futures.ThreadPoolExecutor` directly. – BlackJack Jun 20 '19 at 15:58
  • You might get more meaningful times with `time.process_time()`. – President James K. Polk Jun 20 '19 at 17:50
  • @BlackJack Yes, using a thread doesn't interrupt the event loop. but i have heard from few sources that avoiding threads in asynchronous programming approach is a best practice. As the main purpose of asynchronous programming is to avoid threads creation for processing. – Manthri Anvesh Jun 21 '19 at 06:09
  • @ManthriAnvesh `aiofiles` uses threads and hides that behind `async`/`await` — apparently with quite some overhead. – BlackJack Jun 21 '19 at 13:23

1 Answers1

1

In sync case, you read files sequentially. It's faster to read a file by chunks sequentially.

In async case, your event loop blocks when it's calculating the hash. That's why only one hash can be calculated at the same time. What do the terms “CPU bound” and “I/O bound” mean?

If you want to increase the calculating speed, you need to use threads. Threads can be executed on CPU in parallel. Increasing CHUNK_SIZE should also help.

import hashlib
import os
import time

from pathlib import Path
from multiprocessing.pool import ThreadPool


CHUNK_SIZE = 1024 * 1024


def get_hash(filename):
    md5_hash = hashlib.md5()
    with open(filename, "rb") as f:
        while True:
            chunk = f.read(CHUNK_SIZE)
            if not chunk:
                break
            md5_hash.update(chunk)
        return md5_hash


if __name__ == '__main__':
    directory = Path("your_dir")
    files = [path for path in directory.iterdir() if path.is_file()]

    number_of_workers = os.cpu_count()
    start = time.time()
    with ThreadPool(number_of_workers) as pool:
        files_hash = pool.map(get_hash, files)
    end = time.time()

    print(end - start)

In case of calculating hash for only 1 file: aiofiles uses a thread for each file. OS needs time to create a thread.

rabbit72
  • 25
  • 4
  • my question is not calculating hash of multiple files in parallel. In my use case i will mostly have a single file at an instance of time, therefore above approach of creating a thread for calculating in parallel doesn't help me much. – Manthri Anvesh Jan 26 '20 at 15:58
  • If you use event loop and want to do some CPU task, the best way is to create a thread for calculating a hash. – rabbit72 Jan 27 '20 at 23:27
  • in web API, for CPU intensive task, the common practice is to use separate workers (like Celery). In your case, you can create a single thread and calculate a hash with a synchronous library. – rabbit72 Jan 27 '20 at 23:35