I've read about multiprocessing
, threading
and asyncio
but I'm not quite sure which one I should use for my case.
I'm trying to parse tsv files and put them into different SQLite files (writing only once into each file). The following code works just fine and my question is more about understanding why I should use one and not the other for this case.
def read_large_file(f):
for l in f:
yield l.strip().split()
def worker(infile):
with sqlite3.connect(infile+".sqlite") as connection, open(infile) as f:
connection.execute("""
CREATE TABLE IF NOT EXISTS test (
field1 TEXT PRIMARY KEY,
field2 TEXT,
field3 TEXT
)""")
next(f) # ignore header
connection.executemany("INSERT INTO test VALUES (?, ?, ?)", read_large_file(f))
connection.close()
if __name__ == "__main__":
infiles = [f for f in Path("../data/").glob("test_*_.csv")]
pool = multiprocessing.Pool(processes=len(infiles))
pool.map(create_sqlite, infiles)
pool.close()
pool.join()
I've read the realpython article about those librairies, this article about concurrency in python and a few SO posts like this one but I'm not sure I understood well.
From what I understood, both asyncio
and threading
are limited by the Global Interpreter Lock, meaning they do not execute in parallel while multiprocessing
is not limited by it.
I first choose multiprocessing
because I could run each process at the same time and since they are completely independent from each other it was pretty easy to do.
However, as you can see, I'm working with files and barely do any kind of CPU related task, meaning there is a lot of waiting. From my readings this means I should instead use asyncio
, even though it's not parallel.
As said above, the code is working fine, but should I, since I'm IO bound, use asyncio
or should I stick with multiprocessing
because each process is completely independant ?