14

I am reading a large csv which has around 10 million rows and 20 different columns (with header names).

I have values, 2 columns with dates and some string.

Currently it takes me around 1.5 minutes to load the data with something like this:

df = pd.read_csv('data.csv', index_col='date', parse_dates = 'date')

I want to ask, how can I make this significantly faster yet, have same dataframe once reading data.

I tried using HDF5 database, but it was just as slow.

Subset of the data I am trying to read (I chose 8 columns and gave 3 rows out of actual 20 columns and couple million rows):

Date    Comp     Rating Price   Estprice    Dividend?   Date_earnings   Returns
3/12/2017   Apple   Buy   100   114              Yes    4/4/2017    0.005646835
3/12/2017   Blackberry  Sell    120 97            No    4/25/2017   0.000775331
3/12/2017   Microsoft   Hold    140 100          Yes    5/28/2017   0.003028423

Thanks for the advice.

  • reading data from HDF5 should be significantly faster. Can you provide a sample data set? – MaxU - stand with Ukraine Mar 12 '17 at 15:18
  • Problem with HDF5, as I have a df with significant number of strings, dates and numbers it was not working fast (I used the setting 'fixed' and getting some warnings about the columns which had string representation). Maybe I am missing a trick. – MysterioProgrammer91 Mar 12 '17 at 15:20
  • 1
    don't parse the dates like this / you have a somewhat non standard format; rather parse then after with an explicit format in to_datetime – Jeff Mar 12 '17 at 17:13
  • Does this answer your question? ["Large data" workflows using pandas](https://stackoverflow.com/questions/14262433/large-data-workflows-using-pandas) – notfancy Apr 14 '22 at 06:21

3 Answers3

11

Let's test it!

Data generation:

sz = 10**3

df = pd.DataFrame(np.random.randint(0, 10**6, (sz, 2)), columns=['i1','i2'])
df['date'] = pd.date_range('2000-01-01', freq='1S', periods=len(df))
df['dt2'] = pd.date_range('1980-01-01', freq='999S', periods=len(df))
df['f1'] = np.random.rand(len(df))
df['f2'] = np.random.rand(len(df))
# generate 10 string columns 
for i in range(1, 11):
    df['s{}'.format(i)] =  pd.util.testing.rands_array(10, len(df))

df = pd.concat([df] * 10**3, ignore_index=True).sample(frac=1)
df = df.set_index(df.pop('date').sort_values())

We have generated the following DF

In [59]: df
Out[59]:
                         i1      i2                 dt2        f1     ...              s7          s8          s9         s10
date                                                                  ...
2000-01-01 00:00:00  216625    4179 1980-01-04 04:35:24  0.679989     ...      7G8rLnoocA  E7Ot7oPsJ6  puQamLn0I2  zxHrATQn0m
2000-01-01 00:00:00  374740  967991 1980-01-09 11:07:48  0.202064     ...      wLETO2g8uL  MhtzNLPXCH  PW1uKxY0df  wTakdCe6nK
2000-01-01 00:00:00  152181  627451 1980-01-10 11:49:39  0.956117     ...      mXOsfUPqOy  6IIst7UFDT  nL6XZxrT3r  BxPCFNdZTK
2000-01-01 00:00:00  915732  730737 1980-01-06 10:25:30  0.854145     ...      Crh94m085p  M1tbrorxGT  XWSKk3b8Pv  M9FWQtPzaa
2000-01-01 00:00:00  590262  248378 1980-01-06 11:48:45  0.307373     ...      wRnMPxeopd  JF24uTUwJC  2CRrs9yB2N  hxYrXFnT1H
2000-01-01 00:00:00  161183  620876 1980-01-08 21:48:36  0.207536     ...      cyN0AExPO2  POaldI6Y0l  TDc13rPdT0  xgoDOW8Y1L
2000-01-01 00:00:00  589696  784856 1980-01-12 02:07:21  0.909340     ...      GIRAAVBRpj  xwcnpwFohz  wqcoTMjQ4S  GTcIWXElo7
...                     ...     ...                 ...       ...     ...             ...         ...         ...         ...
2000-01-01 00:16:39  773606  205714 1980-01-12 07:40:21  0.895944     ...      HEkXfD7pku  1ogy12wBom  OT3KmQRFGz  Dp1cK5R4Gq
2000-01-01 00:16:39  915732  730737 1980-01-06 10:25:30  0.854145     ...      Crh94m085p  M1tbrorxGT  XWSKk3b8Pv  M9FWQtPzaa
2000-01-01 00:16:39  990722  567886 1980-01-03 05:50:06  0.676511     ...      gVO3g0I97R  yCqOhTVeEi  imCCeQa0WG  9tslOJGWDJ
2000-01-01 00:16:39  531778  438944 1980-01-04 20:07:48  0.190714     ...      rbLmkbnO5G  ATm3BpWLC0  moLkyY2Msc  7A2UJERrBG
2000-01-01 00:16:39  880791  245911 1980-01-02 15:57:36  0.014967     ...      bZuKNBvrEF  K84u9HyAmG  4yy2bsUVNn  WZQ5Vvl9zD
2000-01-01 00:16:39  239866  425516 1980-01-10 05:26:42  0.667183     ...      6xukg6TVah  VEUz4d92B8  zHDxty6U3d  ItztnI5LmJ
2000-01-01 00:16:39  338368  804695 1980-01-12 05:27:09  0.084818     ...      NM4fdjKBuW  LXGUbLIuw9  SHdpnttX6q  4oXKMsaOJ5

[1000000 rows x 15 columns]

In [60]: df.shape
Out[60]: (1000000, 15)

In [61]: df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000000 entries, 2000-01-01 00:00:00 to 2000-01-01 00:16:39
Data columns (total 15 columns):
i1     1000000 non-null int32
i2     1000000 non-null int32
dt2    1000000 non-null datetime64[ns]
f1     1000000 non-null float64
f2     1000000 non-null float64
s1     1000000 non-null object
s2     1000000 non-null object
s3     1000000 non-null object
s4     1000000 non-null object
s5     1000000 non-null object
s6     1000000 non-null object
s7     1000000 non-null object
s8     1000000 non-null object
s9     1000000 non-null object
s10    1000000 non-null object
dtypes: datetime64[ns](1), float64(2), int32(2), object(10)
memory usage: 114.4+ MB

#print(df.shape)
#print(df.info())

Let's write it to disk in different formats: (CSV, HDF5 fixed, HDF5 table, Feather):

# CSV
df.to_csv('c:/tmp/test.csv')
# HDF5 table format
df.to_hdf('c:/tmp/test.h5', 'test', format='t')
#  HDF5 fixed format
df.to_hdf('c:/tmp/test_fix.h5', 'test')
# Feather format
import feather
feather.write_dataframe(df, 'c:/tmp/test.feather')

Timing:

Now we can measure reading from disk:

In [54]: # CSV
    ...: %timeit pd.read_csv('c:/tmp/test.csv', parse_dates=['date', 'dt2'], index_col=0)
1 loop, best of 3: 12.3 s per loop   # 3rd place

In [55]: # HDF5 fixed format
    ...: %timeit pd.read_hdf('c:/tmp/test_fix.h5', 'test')
1 loop, best of 3: 1.85 s per loop   # 1st place

In [56]: # HDF5 table format
    ...: %timeit pd.read_hdf('c:/tmp/test.h5', 'test')
1 loop, best of 3: 24.2 s per loop   # 4th place

In [57]: # Feather
    ...: %timeit feather.read_dataframe('c:/tmp/test.feather')
1 loop, best of 3: 3.21 s per loop   # 2nd place

If you don't always need to read all data, then it would make sense to store your data in HDF5 table format (and make use of data_columns parameter in order to index those columns, that will be used for filtering).

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 2
    would be nice to update this benchmark in the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#io-perf (more formats and more dtypes) - PRs welcome! – Jeff Mar 12 '17 at 17:07
  • @MaxU Oops :) I read the timing block as another writing test^^ – Tweakimp Jul 11 '18 at 20:02
4

A common approach I usually take when handling large datasets (~4-10 million rows, 15-30 columns) with pandas operations is to save the dataframes into .pkl files for future operations. They do take up more space (sometimes as high as 2x) in terms of file size, but reduce my load times into Jupyter Notebook from 10-50 seconds with csv, to about 1-5 seconds with pkl.

In [1]: %%time
        dfTest = pd.read_pickle('./testFile.pkl')
        print(dfTest.shape)
Out[1]: (10820089, 23)
        Wall time: 1.89 s

In [2]: %%time
        dfTest = pd.read_csv('./testFile.csv')
        print(dfTest.shape)
Out[2]: (10820089, 23)
        Wall time: 18.9 s

See the test file size differences used in this test here.

Extra tip: After I'm done performing operations on the dataset, I usually just output the dataframe back into a csv for smaller archiving of my projects.

Ryan Oz
  • 71
  • 4
1

Depending on what you're using the data for you might benefit from the "chunksize" argument in read_csv, which you can find in the docs.

If you only need to do analysis on the data (you only need to load the data once) and you aren't already using an IDE like jupyter, definitely try it out! You can load the data once and keep it in memory as you try different operations.

Any other advice I think you might find in this thread.

Community
  • 1
  • 1
jfbeltran
  • 1,808
  • 3
  • 13
  • 17