1

I have a 4Gb CSV file with strictly integer data I want to read into pandas DataFrame. Native read_csv consumes all RAM (64Gb) and fails with MemoryError. With explicit dtype, it just takes forever (tried both int and float types).

So, I wrote my own reader:

def read_csv(fname):
    import csv
    reader = csv.reader(open(fname))
    names = reader.next()[1:]  # first row
    dftype = np.float32
    df = pd.DataFrame(0, dtype=dftype, columns=names, index=names)
    for row in reader:
        tag = row[0]
        df.loc[tag] = np.array(row[1:], dtype=dftype)
    return df

Problem: line df.loc[tag] = np.array(row[1:], dtype=dftype) is ~1000 times slower if dftype is np.int32 (~20sec per line), so I ended up using np.float64 and return df.astype(np.int32) (~4 minutes). I also tried Python conversion ([int/float(v) for v in row[1:]]) with the same result.

Why could it be so?

UPD: I have the same behavior on Python 2.7 and 3.5

Marat
  • 15,215
  • 2
  • 39
  • 48

2 Answers2

1

UPDATE: my notebook has 16GB of RAM, so i'll test it with 4 times (64GB / 16Gb = 4) smaller DF:

Setup:

In [1]: df = pd.DataFrame(np.random.randint(0, 10*6, (12000, 47395)), dtype=np.int32)

In [2]: df.shape
Out[2]: (12000, 47395)

In [3]: %timeit -n 1 -r 1 df.to_csv('c:/tmp/big.csv', chunksize=1000)
1 loop, best of 1: 5min 34s per loop

Let's also save this DF in Feather format:

In [4]: import feather

In [6]: df = df.copy()

In [7]: %timeit -n 1 -r 1 feather.write_dataframe(df, 'c:/tmp/big.feather')
1 loop, best of 1: 8.41 s per loop  # yay, it's bit faster...

In [8]: df.shape
Out[8]: (12000, 47395)

In [9]: del df

and read it back:

In [10]: %timeit -n 1 -r 1 df = feather.read_dataframe('c:/tmp/big.feather')
1 loop, best of 1: 17.4 s per loop  # reading is reasonably fast as well

reading from CSV file in chunks is much slower, but it is still not giving me MemoryError:

In [2]: %%timeit -n 1 -r 1
   ...: df = pd.DataFrame()
   ...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000):
   ...:     df = pd.concat([df, chunk])
   ...:     print(df.shape)
   ...: print(df.dtypes.unique())
   ...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int64')]
1 loop, best of 1: 9min 25s per loop

now let's specify dtype=np.int32 explicitly:

In [1]: %%timeit -n 1 -r 1
   ...: df = pd.DataFrame()
   ...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000, dtype=np.int32):
   ...:     df = pd.concat([df, chunk])
   ...:     print(df.shape)
   ...: print(df.dtypes.unique())
   ...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int32')]
1 loop, best of 1: 10min 38s per loop

Testing HDF Storage:

In [10]: %timeit -n 1 -r 1 df.to_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 22.5 s per loop

In [11]: del df

In [12]: %timeit -n 1 -r 1 df = pd.read_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 1.04 s per loop

Conclusion:

if you have a chance to change your storage file format - by all means don't use CSV files - use HDF5 (.h5) or Feather format...

OLD answer:

I would simply use the native Pandas read_csv() method:

chunksize = 10**6
reader = pd.read_csv(filename, index_col=0, chunksize=chunksize)
df = pd.concat([chunk for chunk in reader], ignore_indexes=True)

From your code:

tag = row[0]

df.loc[tag] = np.array(row[1:], dtype=dftype)

It looks like you want to use the first column in your CSV file as an index, hence: index_col=0

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Quote: read_csv consumes all RAM (64Gb) and fails with MemoryError – Marat Jan 13 '17 at 22:21
  • 1
    @Marat are you using `chunksize`? – piRSquared Jan 13 '17 at 22:22
  • @MaxU, yes, I tried everything including chunksize, memory mapped IO and low_memory settings – Marat Jan 13 '17 at 22:24
  • @MaxU, I appreciate your suggestions but I would like to point out that I'm not asking how to read CSV, since I have reasonably fast working solution. My question is rather theoretical and it is about conversion speed – Marat Jan 13 '17 at 22:26
  • @Marat, i have updated the answer - please try it out – MaxU - stand with Ukraine Jan 13 '17 at 22:31
  • @Marat, i'd try to stick to Pandas IO methods - they are way faster, compared to naive loops – MaxU - stand with Ukraine Jan 13 '17 at 22:32
  • memory footprint is not caused by file I/O. Pandas tries to figure out column type from its content and stores a lot of historical data. Chunksize does not improve the situation - it still consumes all the memory. In the question I mentioned that it can be fixed with explicity dtype, but still takes too long (didn't finish in 24 hours). Again, this question is about type conversion, not how to read CSV – Marat Jan 13 '17 at 22:32
  • @MaxU I have the proof that native Pandas methods are in fact less efficient sometimes. My reader compeltes the task in 3 minutes versus at least 24hours for the native Pandas method. – Marat Jan 13 '17 at 22:34
  • @Marat, did you try to use `dtype=np.int32` as a parameter for `pd.read_csv()`? – MaxU - stand with Ukraine Jan 13 '17 at 22:35
  • @MaxU, yes, and I mentioned that in the question: With explicit dtype, it just takes forever – Marat Jan 13 '17 at 22:37
  • @MaxU, thank you for the suggestions. Finally, it looks like the problem is in the typecheck with row assignment. I consider this a bug / deficiency in pandas – Marat Jan 15 '17 at 18:48
1

I suggest you use numpy array for this, for example:

def read_csv(fname):
    import csv
    reader = csv.reader(open(fname))
    names = reader.next()[1:]  # first row
    n = len(names)
    data = np.empty((n, n), np.int32)
    tag_map = {name:i for i, name in enumerate(names)}
    for row in reader:
        tag = row[0]
        data[tag_map[tag], :] = row[1:]
    return names, data

I don't know why int32 is slower than float32, but DataFrame stores data column wise, set elements of every column by df.loc[tag] = ... is slow.

If you want labels for access, can you use xarray:

import xarray
d = xarray.DataArray(data, [("r", names), ("c", names)])
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • thank you, that was exactly the case. Columnwise assignment solves the problem: `df[tag] = np.array(row[1:], dtype=dftype)` – Marat Jan 15 '17 at 18:51
  • Still, even with column-wise assignment float is 3 times faster than int – Marat Jan 15 '17 at 19:18