Can pandas and file system be used as a replacement for databases?

Question

I use pandas to work with data. I like this approach because manipulation of data is very easy in pandas (selecting rows, adding row, removing columns, grouping by, joining tables and so on).

My question is if pandas is also a good way to go if the data are huge. In particular I worry about modifying and extracting data. Before I can modify data or extract something from the data I need to read (load) the data from a file and then, after I did what I wanted to do (selecting or modifying) I need to save the data back to the file. I am afraid that this "loading" and "saving" if data may be very slow for huge data. By huge data I understand several hundred millions rows.

In particular, my question is if pandas can be used as a replacement of databases (for example SQLite or MySQL). Alternatively, would it be faster to use a python interface for MySQL to find a particular row in a huge table (saved in a MySQL database) in comparison with finding the same row in a corresponding data frame that is saved as a file.

score 6 · Accepted Answer · edited Feb 22 '17 at 12:18

Since pandas 0.10.1 one can preselect on-disk with a HDFStore:

import pandas as pd
import numpy.random as rd

df = pd.DataFrame(rd.randn(int(1e6)).reshape(int(1e5), 10), columns=list('abcdefghij'))
store = pd.HDFStore('newstore.h5')

# only data columns can serve as indices to select for on-disk, but there's a 
# speed penalty involved, so it's a conscious decision what becomes data_column!
store.append('df', df, data_columns=['a','b'])

the following stuff happens "on-disk" (and is pretty cool! ;)

In [14]: store.select('df', ['a > 0', 'b > 0'])
Out[14]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24747 entries, 2 to 99998
Data columns:
a    24747  non-null values
b    24747  non-null values
c    24747  non-null values
d    24747  non-null values
e    24747  non-null values
f    24747  non-null values
g    24747  non-null values
h    24747  non-null values
i    24747  non-null values
j    24747  non-null values
dtypes: float64(10)

In [15]: store.select('df', ['a > 0'])
Out[15]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50043 entries, 0 to 99999
Data columns:
a    50043  non-null values
b    50043  non-null values
c    50043  non-null values
d    50043  non-null values
e    50043  non-null values
f    50043  non-null values
g    50043  non-null values
h    50043  non-null values
i    50043  non-null values
j    50043  non-null values
dtypes: float64(10)

So, all you got to do now is to crank up those numbers of the dimensions for the dataframe and see for yourself if it's fast enough for your needs. It's pretty easy to play with!

Can pandas and file system be used as a replacement for databases?

1 Answers1