async 'read_csv' of several data frames in pandas - why isn't it faster

Question

I want to create a code that reads several pandas data frames asynchronously, for example from a CSV file (or from a database)

I wrote the following code, assuming that it should import the two data frames faster, however it seems to do it slower:

import timeit

import pandas as pd
import asyncio

train_to_save = pd.DataFrame(data={'feature1': [1, 2, 3],'period': [1, 1, 1]})
test_to_save = pd.DataFrame(data={'feature1': [1, 4, 12],'period': [2, 2, 2]})

train_to_save.to_csv('train.csv')
test_to_save.to_csv('test.csv')


async def run_async_train():
    return pd.read_csv('train.csv')

async def run_async_test():
    return pd.read_csv('test.csv')

async def run_train_test_asinc():
    df = await asyncio.gather(run_async_train(), run_async_test())
    return df

start_async = timeit.default_timer()
async_train,async_test=asyncio.run(run_train_test_asinc())
finish_async = timeit.default_timer()
time_to_run_async=finish_async-start_async

start = timeit.default_timer()
train=pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
finish = timeit.default_timer()
time_to_run_without_async = finish - start

print(time_to_run_async<time_to_run_without_async)

Why does it read the two data frames faster in the non-async version?

Just to make it clear, I'm really going to read the data from Bigquery so im really interested in speeding both requests (train & test) using the code above.

Thanks in advance!

When it comes to reading (large) files, the bottleneck is usually seeking/reading from the disc, not processing power. So reading two files at the same time might not increase processing power, since the disc has to physically jump back and fort between the two different locations (files). — Quang Hoang, Sep 10 '19 at 13:22
It depends. Databases are usually designed for concurrent requests, so likely yes, but take it with a grain of salt. — Quang Hoang, Sep 10 '19 at 13:30

Mingwei Samuel · Accepted Answer · 2020-03-03T22:03:22.617

pd.read_csv isn't an async method, so I don't believe you're actually getting any parallelism out of this. You'd need to use an async file library like aiofiles to read the files into buffers asynchronously, then send those to pd.read_csv(.).

Note that most filesystems aren't really async, so aiofiles is functionally a thread pool. However it will still likely be faster than reading serially.

Here's an example I had with aiohttp getting csvs from urls:

import io
import asyncio

import aiohttp
import pandas as pd

async def get_csv_async(client, url):
    # Send a request.
    async with client.get(url) as response:
        # Read entire resposne text and convert to file-like using StringIO().
        with io.StringIO(await response.text()) as text_io:
            return pd.read_csv(text_io)

async def get_all_csvs_async(urls):
    async with aiohttp.ClientSession() as client:
        # First create all futures at once.
        futures = [ get_csv_async(client, url) for url in urls ]
        # Then wait for all the futures to complete.
        return await asyncio.gather(*futures)

urls = [
    # Some random CSV urls from the internet
    'https://people.sc.fsu.edu/~jburkardt/data/csv/hw_25000.csv',
    'https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv',
    'https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv',
]

if '__main__' == __name__:
    # Run event loop
    # can just do `csvs = asyncio.run(get_all_csvs_async(urls))` in python 3.7+
    csvs = asyncio.get_event_loop().run_until_complete(get_all_csvs_async(urls))

    for csv in csvs:
        print(csv)

async 'read_csv' of several data frames in pandas - why isn't it faster

1 Answers1

Linked