UPDATE: my notebook has 16GB of RAM, so i'll test it with 4 times (64GB / 16Gb = 4) smaller DF:
Setup:
In [1]: df = pd.DataFrame(np.random.randint(0, 10*6, (12000, 47395)), dtype=np.int32)
In [2]: df.shape
Out[2]: (12000, 47395)
In [3]: %timeit -n 1 -r 1 df.to_csv('c:/tmp/big.csv', chunksize=1000)
1 loop, best of 1: 5min 34s per loop
Let's also save this DF in Feather format:
In [4]: import feather
In [6]: df = df.copy()
In [7]: %timeit -n 1 -r 1 feather.write_dataframe(df, 'c:/tmp/big.feather')
1 loop, best of 1: 8.41 s per loop # yay, it's bit faster...
In [8]: df.shape
Out[8]: (12000, 47395)
In [9]: del df
and read it back:
In [10]: %timeit -n 1 -r 1 df = feather.read_dataframe('c:/tmp/big.feather')
1 loop, best of 1: 17.4 s per loop # reading is reasonably fast as well
reading from CSV file in chunks is much slower, but it is still not giving me MemoryError
:
In [2]: %%timeit -n 1 -r 1
...: df = pd.DataFrame()
...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000):
...: df = pd.concat([df, chunk])
...: print(df.shape)
...: print(df.dtypes.unique())
...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int64')]
1 loop, best of 1: 9min 25s per loop
now let's specify dtype=np.int32
explicitly:
In [1]: %%timeit -n 1 -r 1
...: df = pd.DataFrame()
...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000, dtype=np.int32):
...: df = pd.concat([df, chunk])
...: print(df.shape)
...: print(df.dtypes.unique())
...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int32')]
1 loop, best of 1: 10min 38s per loop
Testing HDF Storage:
In [10]: %timeit -n 1 -r 1 df.to_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 22.5 s per loop
In [11]: del df
In [12]: %timeit -n 1 -r 1 df = pd.read_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 1.04 s per loop
Conclusion:
if you have a chance to change your storage file format - by all means don't use CSV files - use HDF5 (.h5) or Feather format...
OLD answer:
I would simply use the native Pandas read_csv() method:
chunksize = 10**6
reader = pd.read_csv(filename, index_col=0, chunksize=chunksize)
df = pd.concat([chunk for chunk in reader], ignore_indexes=True)
From your code:
tag = row[0]
df.loc[tag] = np.array(row[1:], dtype=dftype)
It looks like you want to use the first column in your CSV file as an index, hence: index_col=0