In this answer to async 'read_csv' of several data frames in pandas - why isn't it faster it is explained how to asynchronously read pandas DataFrames from csv data obtained from a web request.
I modified it to read some csv files on disk by using aiofiles
, but got no speedup nonetheless.
I wonder if I did something wrong or if there is some unavoidable limitation, like pd.read_csv
being blocking.
Here's the normal version of the code:
from time import perf_counter
import pandas as pd
def pandas_read_many(paths):
start = perf_counter()
results = [pd.read_csv(p) for p in paths]
end = perf_counter()
print(f"Pandas version {end - start:0.2f}s")
return results
The async version involves reading the file with aiofiles
and converting it to a text buffer with io.StringIO
before passing it to pd.read_csv
.
import io
import aiofiles
async def async_read_csv(path):
async with aiofiles.open(path) as f:
text = await f.read()
with io.StringIO(text) as text_io:
return pd.read_csv(text_io)
async def async_read_many(paths):
start = perf_counter()
results = await asyncio.gather(*(async_read_csv(p) for p in paths))
end = perf_counter()
print(f"Async version {end - start:0.2f}s")
return results
For fairness, here it is the synchronous translation.
def sync_read_csv(path):
with open(path) as f:
text = f.read()
with io.StringIO(text) as text_io:
return pd.read_csv(text_io)
def sync_read_many(paths):
start = perf_counter()
results = [sync_read_csv(p) for p in paths]
end = perf_counter()
print(f"Sync version {end - start:0.2f}s")
return results
Finally the comparison, where I read 8 csv files of approximately 125MB each.
import asyncio
paths = [...]
asyncio.run(async_read_many(paths))
sync_read_many(paths)
pandas_read_many(paths)
# Async version 24.32s
# Sync version 24.87s
# Pandas version 18.37s