0

In this answer to async 'read_csv' of several data frames in pandas - why isn't it faster it is explained how to asynchronously read pandas DataFrames from csv data obtained from a web request.

I modified it to read some csv files on disk by using aiofiles, but got no speedup nonetheless. I wonder if I did something wrong or if there is some unavoidable limitation, like pd.read_csv being blocking.

Here's the normal version of the code:

from time import perf_counter
import pandas as pd

def pandas_read_many(paths):
    start = perf_counter()
    results = [pd.read_csv(p) for p in paths]
    end = perf_counter()
    print(f"Pandas version {end - start:0.2f}s")
    return results 

The async version involves reading the file with aiofiles and converting it to a text buffer with io.StringIO before passing it to pd.read_csv.

import io
import aiofiles

async def async_read_csv(path):
    async with aiofiles.open(path) as f:
        text = await f.read()
        with io.StringIO(text) as text_io:
            return pd.read_csv(text_io)
        
async def async_read_many(paths):
    start = perf_counter()
    results = await asyncio.gather(*(async_read_csv(p) for p in paths))
    end = perf_counter()
    print(f"Async version {end - start:0.2f}s")
    return results

For fairness, here it is the synchronous translation.

def sync_read_csv(path):
    with open(path) as f:
        text = f.read()
        with io.StringIO(text) as text_io:
            return pd.read_csv(text_io)
        
def sync_read_many(paths):
    start = perf_counter()
    results = [sync_read_csv(p) for p in paths]
    end = perf_counter()
    print(f"Sync version {end - start:0.2f}s")
    return results

Finally the comparison, where I read 8 csv files of approximately 125MB each.

import asyncio

paths = [...]
asyncio.run(async_read_many(paths))
sync_read_many(paths)
pandas_read_many(paths)

# Async version 24.32s
# Sync version 24.87s
# Pandas version 18.37s
edd313
  • 1,109
  • 7
  • 20
  • The pandas version is faster because it is reading the files directly, not converting them into `io.StringIO` objects first. The real comparison is between the first two. – MattDMo Jun 02 '23 at 15:41
  • you don't need that `f.read() - io.StringIO` at all, it redundantly slows down the processing – RomanPerekhrest Jun 02 '23 at 15:42
  • @MattDMo I agree but I'm not sure if that's the whole story. For example, I could artificially make the code slower by making it sleep every time I load a csv file. If the sleep time is long enough, the async version becomes faster (I tested this). @RomanPerekhrest that passage is necessary, because `pandas.read_csv` accepts only file paths or file-like-objects as input. – edd313 Jun 02 '23 at 16:01
  • Please fix the inconsistent indentation of the `with` statements. – Barmar Jun 02 '23 at 16:11
  • Is the problem that you want faster csv parsing? The pyarrow engine is multithreaded and should be faster: `pd.read_csv(..., engine="pyarrow")`. There's also tools like polars and duckdb that have parallel csv readers and both export to pandas easily. – jqurious Jun 02 '23 at 21:10
  • Thanks @jqurious, I'm learning async and I'm trying to understand its potential. – edd313 Jun 03 '23 at 10:10

0 Answers0