Reading huge CSV files using Pandas vs. MySQL

Question

I have a 500+ MB CSV data file. My question is, which would be faster for data manipulation (e.g., reading, processing) is the Python MySQL client would be faster since all work is mapped into SQL queries and optimization is left to the optimizer. But, at the same time Pandas is dealing with a file which should be faster than communicating with a server?

I have already checked "Large data" work flows using pandas, Best practices for importing large CSV files, Fastest way to write large CSV with Python, and Most efficient way to parse a large .csv in python?. However, I haven't really found any comparison regarding Pandas and MySQL.

Use Case:

I am working on text dataset that consists of 1,737,123 rows and 8 columns. I am feeding this dataset into RNN/LSTM network. I do some preprocessing in prior to feeding which is encoding using a customized encoding algorithm.

More details

I have 250+ experiments to do and 12 architectures (different models design) to try.

I am confused, I feel I miss something.

I've found the fastest way for loading MySQL data, is to do it through `LOAD DATA INFILE`. It's by far, the most efficient route. — Blue, Oct 20 '18 at 19:46
@FrankerZ Could you please elaborate whether do you mean the most efficient even when comparing with other Python techniques, or it's the most when loading from MySQL? — ndrwnaguib, Oct 20 '18 at 19:48
Voting to close as unclear: impossible to answer without knowing your use scenario(s). — ivan_pozdeev, Oct 20 '18 at 19:53

ivan_pozdeev · Accepted Answer · 2018-10-20T20:30:45.120

There's no comparison online 'cuz these two scenarios give different results:

With Pandas, you end up with a Dataframe in memory (as a NumPy ndarray under the hood), accessible as native Python objects
With MySQL client, you end up with data in a MySQL database on disk (unless you're using an in-memory database), accessible via IPC/sockets

So, the performance will depend on

how much data needs to be transferred by lower-speed channels (IPC, disk, network)
how comparatively fast is transferring vs processing (which of them is the bottleneck)
which data format your processing facilities prefer (i.e. what additional conversions will be involved)

E.g.:

If your processing facility can reside in the same (Python) process that will be used to read it, reading it directly into Python types is preferrable since you won't need to transfer it all to the MySQL process, then back again (converting formats each time).
OTOH if your processing facility is implemented in some other process and/or language, or e.g. resides within a computing cluster, hooking it to MySQL directly may be faster by eliminating the comparatively slow Python from equation, and because you'll need to be transferring the data again and converting it into the processing app's native objects anyway.

Reading huge CSV files using Pandas vs. MySQL

1 Answers1