1

I've read about multiprocessing, threading and asyncio but I'm not quite sure which one I should use for my case.

I'm trying to parse tsv files and put them into different SQLite files (writing only once into each file). The following code works just fine and my question is more about understanding why I should use one and not the other for this case.

def read_large_file(f):
    for l in f:
        yield l.strip().split()


def worker(infile):
        with sqlite3.connect(infile+".sqlite") as connection, open(infile) as f:
        connection.execute("""
            CREATE TABLE IF NOT EXISTS test (
                field1 TEXT PRIMARY KEY,
                field2 TEXT,
                field3 TEXT
            )""")

        next(f)  # ignore header
        connection.executemany("INSERT INTO test VALUES (?, ?, ?)", read_large_file(f))
    connection.close()


if __name__ == "__main__":
    infiles = [f for f in Path("../data/").glob("test_*_.csv")]
    pool = multiprocessing.Pool(processes=len(infiles))
    pool.map(create_sqlite, infiles)
    pool.close()
    pool.join()

I've read the realpython article about those librairies, this article about concurrency in python and a few SO posts like this one but I'm not sure I understood well.

From what I understood, both asyncio and threading are limited by the Global Interpreter Lock, meaning they do not execute in parallel while multiprocessing is not limited by it.

I first choose multiprocessing because I could run each process at the same time and since they are completely independent from each other it was pretty easy to do.

However, as you can see, I'm working with files and barely do any kind of CPU related task, meaning there is a lot of waiting. From my readings this means I should instead use asyncio, even though it's not parallel.

As said above, the code is working fine, but should I, since I'm IO bound, use asyncio or should I stick with multiprocessing because each process is completely independant ?

Plopp
  • 947
  • 8
  • 16
  • Look at [Sqlite's behavior](https://www.sqlite.org/faq.html#q5): "only one process can be making changes to the database at any moment in time". In your code, you appear to be creating an arbitrary number of Sqlite files (it would help to clarify this point since it is quite relevant) and writing to each just once. In that use case, `multiprocessing` (or `concurrent.futures`) might be suitable. As an aside, setting the number of processes in the pool based on the count of input data can be an entertaining way to crash your machine. – ralex Apr 25 '19 at 05:11
  • I'm indeed using different sqlitefiles and writing only once. I'll take a look at `concurrent.futures` out of curiosity tho. Thanks for the heads up about the number of processes declaration. – Plopp Apr 25 '19 at 07:53

0 Answers0