9

How can I use pd.read_csv() to iteratively chunk through a file and retain the dtype and other meta-information as if I read in the entire dataset at once?

I need to read in a dataset that is too large to fit into memory. I would like to import the file using pd.read_csv and then immediately append the chunk into an HDFStore. However, the data type inference knows nothing about subsequent chunks.

If the first chunk stored in the table contains only int and a subsequent chunk contains a float, an exception will be raised. So I need to first iterate through the dataframe using read_csv and retain the highest inferred type. In addition, for object types, I need to retain the maximum length as these will be stored as strings in the table.

Is there a pandonic way of retaining only this information without reading in the entire dataset?

Zelazny7
  • 39,946
  • 18
  • 70
  • 84

1 Answers1

14

I didn't think it would be this intuitive, otherwise I wouldn't have posted the question. But once again, pandas makes things a breeze. However, keeping the question as this information might be useful to others working with large data:

In [1]: chunker = pd.read_csv('DATASET.csv', chunksize=500, header=0)

# Store the dtypes of each chunk into a list and convert it to a dataframe:

In [2]: dtypes = pd.DataFrame([chunk.dtypes for chunk in chunker])

In [3]: dtypes.values[:5]
Out[3]:
array([[int64, int64, int64, object, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64]], dtype=object)

# Very cool that I can take the max of these data types and it will preserve the hierarchy:

In [4]: dtypes.max().values
Out[4]: array([int64, int64, int64, object, int64, int64, int64, int64], dtype=object)

# I can now store the above into a dictionary:

types = dtypes.max().to_dict()

# And pass it into pd.read_csv fo the second run:

chunker = pd.read_csv('tree_prop_dset.csv', dtype=types, chunksize=500)
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • you could also use skiprows=a list of rows to skip, and have it skip every row in 1-9 so u r sampling only every 10th rows, would be much faster (and prob get the answer u want), I think u will need to generate this of skipped rows yourself – Jeff Mar 21 '13 at 20:09
  • @Jeff How would you get this without knowing the total number of rows in the csv, which would entail reading it all in – Luke Jun 18 '14 at 22:42
  • 5
    You should also consider choosing the max dtypes after each chunk iteration, instead of storing all dtypes and reducing at the end. For example, a csv with 2MM rows, and a chunk size of 500, would result in 400,000 rows in the frame! – jastr Mar 01 '16 at 04:33
  • 1
    A risk of error in the promotion approach before transforming data types into dictionaries, For exemple, using `.max` the code `pd.Series([np.float32().dtype]*5+[np.uint32().dtype]).max()` will incorrectly result in "np.uint32" insted of "np.float32". The approach that is capable of supporting data types respecting promotion rules is replacing `.max` by `.apply(lambda x: np.result_type(*x), axis=0)` – the_RR Dec 22 '22 at 18:54
  • To make a first pass without storing all the datatypes, `from functools import reduce; final_types = reduce( lambda acc, dt: acc.combine(dt, np.result_type), (chunk.dtypes for chunk in pd.read_csv('big_file.csv', chunksize=10000)) )`. – arielCo Aug 29 '23 at 05:23