Async read_csv in Pandas

Question

In this answer to async 'read_csv' of several data frames in pandas - why isn't it faster it is explained how to asynchronously read pandas DataFrames from csv data obtained from a web request.

I modified it to read some csv files on disk by using aiofiles, but got no speedup nonetheless. I wonder if I did something wrong or if there is some unavoidable limitation, like pd.read_csv being blocking.

Here's the normal version of the code:

from time import perf_counter
import pandas as pd

def pandas_read_many(paths):
    start = perf_counter()
    results = [pd.read_csv(p) for p in paths]
    end = perf_counter()
    print(f"Pandas version {end - start:0.2f}s")
    return results

The async version involves reading the file with aiofiles and converting it to a text buffer with io.StringIO before passing it to pd.read_csv.

import io
import aiofiles

async def async_read_csv(path):
    async with aiofiles.open(path) as f:
        text = await f.read()
        with io.StringIO(text) as text_io:
            return pd.read_csv(text_io)
        
async def async_read_many(paths):
    start = perf_counter()
    results = await asyncio.gather(*(async_read_csv(p) for p in paths))
    end = perf_counter()
    print(f"Async version {end - start:0.2f}s")
    return results

For fairness, here it is the synchronous translation.

def sync_read_csv(path):
    with open(path) as f:
        text = f.read()
        with io.StringIO(text) as text_io:
            return pd.read_csv(text_io)
        
def sync_read_many(paths):
    start = perf_counter()
    results = [sync_read_csv(p) for p in paths]
    end = perf_counter()
    print(f"Sync version {end - start:0.2f}s")
    return results

Finally the comparison, where I read 8 csv files of approximately 125MB each.

import asyncio

paths = [...]
asyncio.run(async_read_many(paths))
sync_read_many(paths)
pandas_read_many(paths)

# Async version 24.32s
# Sync version 24.87s
# Pandas version 18.37s

The pandas version is faster because it is reading the files directly, not converting them into `io.StringIO` objects first. The real comparison is between the first two. — MattDMo, Jun 02 '23 at 15:41
you don't need that `f.read() - io.StringIO` at all, it redundantly slows down the processing — RomanPerekhrest, Jun 02 '23 at 15:42
@MattDMo I agree but I'm not sure if that's the whole story. For example, I could artificially make the code slower by making it sleep every time I load a csv file. If the sleep time is long enough, the async version becomes faster (I tested this). @RomanPerekhrest that passage is necessary, because `pandas.read_csv` accepts only file paths or file-like-objects as input. — edd313, Jun 02 '23 at 16:01
Please fix the inconsistent indentation of the `with` statements. — Barmar, Jun 02 '23 at 16:11
Is the problem that you want faster csv parsing? The pyarrow engine is multithreaded and should be faster: `pd.read_csv(..., engine="pyarrow")`. There's also tools like polars and duckdb that have parallel csv readers and both export to pandas easily. — jqurious, Jun 02 '23 at 21:10
Thanks @jqurious, I'm learning async and I'm trying to understand its potential. — edd313, Jun 03 '23 at 10:10

Async read_csv in Pandas

0 Answers0