why int conversion is so much slower than float in pandas?

Question

I have a 4Gb CSV file with strictly integer data I want to read into pandas DataFrame. Native read_csv consumes all RAM (64Gb) and fails with MemoryError. With explicit dtype, it just takes forever (tried both int and float types).

So, I wrote my own reader:

def read_csv(fname):
    import csv
    reader = csv.reader(open(fname))
    names = reader.next()[1:]  # first row
    dftype = np.float32
    df = pd.DataFrame(0, dtype=dftype, columns=names, index=names)
    for row in reader:
        tag = row[0]
        df.loc[tag] = np.array(row[1:], dtype=dftype)
    return df

Problem: line df.loc[tag] = np.array(row[1:], dtype=dftype) is ~1000 times slower if dftype is np.int32 (~20sec per line), so I ended up using np.float64 and return df.astype(np.int32) (~4 minutes). I also tried Python conversion ([int/float(v) for v in row[1:]]) with the same result.

Why could it be so?

UPD: I have the same behavior on Python 2.7 and 3.5

Did you try to use native Pandas `read_csv()`? `df = pd.read_csv('/path/to/file.csv')` — MaxU - stand with Ukraine, Jan 13 '17 at 22:01
What about `np.intc` ... stick to the native word size. Assuming a 64 system, int32 will be slower. — tdelaney, Jan 13 '17 at 22:04
my dataset is ajascency matrix 47395 x 47395. Nope, read_csv doesn't work in this case and I tried many different approaches — Marat, Jan 13 '17 at 22:23
Possible duplicate of [Is there a reason Python 3 enumerates slower than Python 2?](http://stackoverflow.com/questions/23453133/is-there-a-reason-python-3-enumerates-slower-than-python-2): I explain: since Python 3, integers don't rely on the machine architecture anymore, but floats still do. — Jean-François Fabre, Jan 13 '17 at 22:29
@Marat, saying `ajascency matrix` - do you mean a matrix consisting of `0` and `1` or consisting of more values, but __mostly__ of zeros? — MaxU - stand with Ukraine, Jan 14 '17 at 09:35
@MaxU, yes, it is mostly zeros. it comes from a weighted graph so occasionally there are values greater than `1`. — Marat, Jan 15 '17 at 16:40

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

UPDATE: my notebook has 16GB of RAM, so i'll test it with 4 times (64GB / 16Gb = 4) smaller DF:

Setup:

In [1]: df = pd.DataFrame(np.random.randint(0, 10*6, (12000, 47395)), dtype=np.int32)

In [2]: df.shape
Out[2]: (12000, 47395)

In [3]: %timeit -n 1 -r 1 df.to_csv('c:/tmp/big.csv', chunksize=1000)
1 loop, best of 1: 5min 34s per loop

Let's also save this DF in Feather format:

In [4]: import feather

In [6]: df = df.copy()

In [7]: %timeit -n 1 -r 1 feather.write_dataframe(df, 'c:/tmp/big.feather')
1 loop, best of 1: 8.41 s per loop  # yay, it's bit faster...

In [8]: df.shape
Out[8]: (12000, 47395)

In [9]: del df

and read it back:

In [10]: %timeit -n 1 -r 1 df = feather.read_dataframe('c:/tmp/big.feather')
1 loop, best of 1: 17.4 s per loop  # reading is reasonably fast as well

reading from CSV file in chunks is much slower, but it is still not giving me MemoryError:

In [2]: %%timeit -n 1 -r 1
   ...: df = pd.DataFrame()
   ...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000):
   ...:     df = pd.concat([df, chunk])
   ...:     print(df.shape)
   ...: print(df.dtypes.unique())
   ...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int64')]
1 loop, best of 1: 9min 25s per loop

now let's specify dtype=np.int32 explicitly:

In [1]: %%timeit -n 1 -r 1
   ...: df = pd.DataFrame()
   ...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000, dtype=np.int32):
   ...:     df = pd.concat([df, chunk])
   ...:     print(df.shape)
   ...: print(df.dtypes.unique())
   ...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int32')]
1 loop, best of 1: 10min 38s per loop

Testing HDF Storage:

In [10]: %timeit -n 1 -r 1 df.to_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 22.5 s per loop

In [11]: del df

In [12]: %timeit -n 1 -r 1 df = pd.read_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 1.04 s per loop

Conclusion:

if you have a chance to change your storage file format - by all means don't use CSV files - use HDF5 (.h5) or Feather format...

OLD answer:

I would simply use the native Pandas read_csv() method:

chunksize = 10**6
reader = pd.read_csv(filename, index_col=0, chunksize=chunksize)
df = pd.concat([chunk for chunk in reader], ignore_indexes=True)

From your code:

tag = row[0]

df.loc[tag] = np.array(row[1:], dtype=dftype)

It looks like you want to use the first column in your CSV file as an index, hence: index_col=0

Quote: read_csv consumes all RAM (64Gb) and fails with MemoryError — Marat, Jan 13 '17 at 22:21
@MaxU, yes, I tried everything including chunksize, memory mapped IO and low_memory settings — Marat, Jan 13 '17 at 22:24
@MaxU, I appreciate your suggestions but I would like to point out that I'm not asking how to read CSV, since I have reasonably fast working solution. My question is rather theoretical and it is about conversion speed — Marat, Jan 13 '17 at 22:26
@Marat, i'd try to stick to Pandas IO methods - they are way faster, compared to naive loops — MaxU - stand with Ukraine, Jan 13 '17 at 22:32
memory footprint is not caused by file I/O. Pandas tries to figure out column type from its content and stores a lot of historical data. Chunksize does not improve the situation - it still consumes all the memory. In the question I mentioned that it can be fixed with explicity dtype, but still takes too long (didn't finish in 24 hours). Again, this question is about type conversion, not how to read CSV — Marat, Jan 13 '17 at 22:32
@MaxU I have the proof that native Pandas methods are in fact less efficient sometimes. My reader compeltes the task in 3 minutes versus at least 24hours for the native Pandas method. — Marat, Jan 13 '17 at 22:34
@Marat, did you try to use `dtype=np.int32` as a parameter for `pd.read_csv()`? — MaxU - stand with Ukraine, Jan 13 '17 at 22:35
@MaxU, yes, and I mentioned that in the question: With explicit dtype, it just takes forever — Marat, Jan 13 '17 at 22:37
@MaxU, thank you for the suggestions. Finally, it looks like the problem is in the typecheck with row assignment. I consider this a bug / deficiency in pandas — Marat, Jan 15 '17 at 18:48

HYRY · Accepted Answer · 2017-01-14T01:27:21.980

1

I suggest you use numpy array for this, for example:

def read_csv(fname):
    import csv
    reader = csv.reader(open(fname))
    names = reader.next()[1:]  # first row
    n = len(names)
    data = np.empty((n, n), np.int32)
    tag_map = {name:i for i, name in enumerate(names)}
    for row in reader:
        tag = row[0]
        data[tag_map[tag], :] = row[1:]
    return names, data

I don't know why int32 is slower than float32, but DataFrame stores data column wise, set elements of every column by df.loc[tag] = ... is slow.

If you want labels for access, can you use xarray:

import xarray
d = xarray.DataArray(data, [("r", names), ("c", names)])

edited Jan 14 '17 at 01:27

answered Jan 14 '17 at 01:20

HYRY

94,853
25
187
187

thank you, that was exactly the case. Columnwise assignment solves the problem: `df[tag] = np.array(row[1:], dtype=dftype)` – Marat Jan 15 '17 at 18:51
Still, even with column-wise assignment float is 3 times faster than int – Marat Jan 15 '17 at 19:18

why int conversion is so much slower than float in pandas?

2 Answers2

Conclusion: