2

I understand that one of the reasons why pandas can be relatively slow importing csv files is that it needs to scan the entire content of a column before guessing the type (see the discussions around the mostly deprecated low_memory option for pandas.read_csv). Is my understanding correct?

If it is, what would be a good format in which to store a dataframe, and which explicitly specifies data types, so pandas doesn't have to guess (SQL is not an option for now)?

Any option in particular from those listed here?

My dataframes have floats, integers, dates, strings and Y/N, so formats supporting numeric values only won't do.

halfer
  • 19,824
  • 17
  • 99
  • 186
Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112

3 Answers3

2

One option is to use numpy.genfromtxt with delimiter=',', names=True, then to initialize the pandas dataframe with the numpy array. The numpy array will be structured and the pandas constructor should automatically set the field names.

In my experience this performs well.

shayaan
  • 1,482
  • 1
  • 15
  • 32
1

You can improve the efficiency of importing from a CSV file by specifying column names and their datatypes to your call to pandas.read_csv. If you have existing column headers in the file, you probably don't have to specify the names and can just use those, but I like to skip the header and specify names for completeness:

import pandas as pd
import numpy as np
col_names = ['a', 'b', 'whatever', 'your', 'names', 'are']
col_types = {k: np.int32 for k in col_names}  # create the type dict
col_types['a'] = 'object'  # can change whichever ones you like
df = pd.read_csv(fname,
                 header = None,  # since we are specifying our own names
                 skiprows=[0],  # if you *do* have a header row, skip it
                 names=col_names,
                 dtype=col_types)

On a large sample dataset comprising mostly integer columns, this was about 20% faster than specifying dtype='object' in the call to pd.read_csv for me.

Engineero
  • 12,340
  • 5
  • 53
  • 75
1

I would consider either HDF5 format or Feather Format. Both of them are pretty fast (Feather might be faster, but HDF5 is more feature rich - for example reading from disk by index) and both of them store the type of columns, so they don't have to guess dtypes and they don't have to convert data types (for example strings to numerical or strings to datetimes) when loading data.

Here are some speed comparisons:

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419