I have a dataset with 8 columns and about 5 million rows. The size of the file is more than 400 mb. I am trying to separate columns. The file extension is .dat
and columns are one-space
separated.
Input:
00022d3f5b17 00022d9064bc 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00022dba8f51 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00022de1c6c1 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 003065f30f37 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00904b48a3b6 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00904b83a0ea 1073260803 1073260810 819213 439954 819213 439954
00904b4557d3 00904b85d3cf 1073260803 1073261920 817526 439458 817526 439458
00022de73863 00904b14b494 1073260804 1073265410 817558 439525 817558 439525
code:
import pandas as pd
df = pd.read_csv('sorted.dat', sep=' ', header=None, names=['id_1', 'id_2', 'time_1', 'time_2', 'gps_1', 'gps_2', 'gps_3', 'gps_4'])
#print df
df.to_csv('output_1.csv', columns = ['id_1', 'time_1', 'time_2', 'gps_1', 'gps_2'])
df.to_csv('output_2.csv', columns = ['id_2', 'time_1', 'time_2', 'gps_3', 'gps_4'])
Output will be one file with col[1], col[3], col[4], col[5], col[6]
and another output with col[2], col[3], col[4], col[7], col[8]
.
I am getting this error
Traceback (most recent call last):
File "split_col_pandas.py", line 3, in <module>
df = pd.read_csv('dartmouthsorted.dat', sep=' ', header=None, names=['id_1', 'id_2', 'time_1', 'time_2', 'gps_1', 'gps_2', 'gps_3', 'gps_4'])
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 325, in _read
return parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 823, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 224, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 360, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5241, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3999, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4076, in form_blocks
int_blocks = _multi_blockify(int_items)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4145, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4188, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError