1

I have a dataset with 8 columns and about 5 million rows. The size of the file is more than 400 mb. I am trying to separate columns. The file extension is .dat and columns are one-space separated.

Input:

00022d3f5b17 00022d9064bc 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00022dba8f51 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00022de1c6c1 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 003065f30f37 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00904b48a3b6 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00904b83a0ea 1073260803 1073260810 819213 439954 819213 439954
00904b4557d3 00904b85d3cf 1073260803 1073261920 817526 439458 817526 439458
00022de73863 00904b14b494 1073260804 1073265410 817558 439525 817558 439525

code:

import pandas as pd 

df = pd.read_csv('sorted.dat', sep=' ', header=None, names=['id_1', 'id_2', 'time_1', 'time_2', 'gps_1', 'gps_2', 'gps_3', 'gps_4'])

#print df

df.to_csv('output_1.csv', columns = ['id_1', 'time_1', 'time_2', 'gps_1', 'gps_2'])

df.to_csv('output_2.csv', columns = ['id_2', 'time_1', 'time_2', 'gps_3', 'gps_4']) 

Output will be one file with col[1], col[3], col[4], col[5], col[6] and another output with col[2], col[3], col[4], col[7], col[8].

I am getting this error

Traceback (most recent call last):
  File "split_col_pandas.py", line 3, in <module>
    df = pd.read_csv('dartmouthsorted.dat', sep=' ', header=None, names=['id_1', 'id_2', 'time_1', 'time_2', 'gps_1', 'gps_2', 'gps_3', 'gps_4'])
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 325, in _read
    return parser.read()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 823, in read
    df = DataFrame(col_dict, columns=columns, index=index)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 224, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 360, in _init_dict
    return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5241, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3999, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4076, in form_blocks
    int_blocks = _multi_blockify(int_items)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4145, in _multi_blockify
    values, placement = _stack_arrays(list(tup_block), dtype)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4188, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
MemoryError
Sitz Blogz
  • 1,061
  • 6
  • 30
  • 54

2 Answers2

1

try this:

columns = ['id_1', 'time_1', 'time_2', 'gps_1', 'gps_2']
df[columns].to_csv('output_1.csv')

columns = ['id_2', 'time_1', 'time_2', 'gps_3', 'gps_4']
df[columns].to_csv('output_2.csv')

Also, check out this post about Memory errors in Python: Memory errors and list limits?

Update Edit

The post author also requested that after the two new csv files have been saved, that output_1.csv and output_2.csv be recombined so that both id_1 and id_2 are in the same column, and gps_1 and gps_3 become a single column, and gps_2 and gps_4 become a single column.

There are lots of ways to do this, but here's one way (readability was chosen over efficiency):

columns = ['id_merged', 'time_1', 'time_2', 'gps_1or3', 'gps_2or4']
df1 = pd.read_csv('output_1.csv', names=columns, skiprows=1)
df2 = pd.read_csv('output_2.csv', names=columns, skiprows=1)

df = pd.concat([df1, df2])  # your final dataframe

One potential issue with this is that you will end up with null values in some places so they need to be handled appropriately or you'll throw errors, plus there's a danger that the new id_merged column will have duplicate keys, but that's a problem for another question...

For more info on the update, see the docs regarding joins, concats and merges: http://pandas.pydata.org/pandas-docs/stable/merging.html

Community
  • 1
  • 1
Projski
  • 181
  • 8
  • Great ! Thank you so much ! This works just as needed. May I ask for a small addition the next step is I want to merge both the output files. I know how to merge them from terminal but would be great to have a final output within the code. – Sitz Blogz May 18 '16 at 00:02
  • Just so that I'm clear, you want to save the items to csv separately, and then merge them back into a dataframe to be worked on further? When you merge them, should the merged result be identical to the original `df` object? Or do you want to stack certain columns, and have `id_1` and `id_2` become `id_merged`, and `gps_1` and `gps_3` be stacked into `gps_a` etc ? – Projski May 18 '16 at 00:04
  • Merge as in `id_1` & `id_2` in `col[1]`, `time_1` & `time_2` same `col[2] & col[3]` and finally `gps_1, gps_2` and `gps_3, gps_4` in `col[4] & col[5]` In simple words the later option mentioned by you. – Sitz Blogz May 18 '16 at 00:08
  • I another words the first input file is edge connection, now I am making it node connection. – Sitz Blogz May 18 '16 at 00:10
  • @SitzBlogz if this answered your question, would you mind accepting it so we can mark the question as resolved? – Projski May 18 '16 at 04:37
  • Thank you for the answer let me execute and check and yes duplication is there no problem with that, here duplication in id is very much needed for my next analysis. – Sitz Blogz May 18 '16 at 09:03
  • I have a query to ask here. The updated code can it be used within the previous code? – Sitz Blogz May 18 '16 at 09:40
  • Thank you .. I put both the code parts in one and executed. Took a while as its a large file (1.2GB) size into memory. As you mentioned there are several duplication which calls for another questions if I am not able to get it done. – Sitz Blogz May 18 '16 at 12:15
  • 1
    Yes both previous and updated blocks of code are compatible, glad it worked out! – Projski May 18 '16 at 15:16
1

This method is very memory efficient, as it just operates on one row at a time. It also does not require Pandas.

import csv

input_file = 'sorted.dat'
output_file_1 = 'output_1.csv'
output_file_2 = 'output_2.csv'
columns_1 = ['id_1', 'time_1', 'time_2', 'gps_1', 'gps_2']
columns_2 = ['id_2', 'time_1', 'time_2', 'gps_3', 'gps_4']

with open(input_file, 'rb') as file_in, \
     open(output_file_1, 'wb') as file_out_1, \ 
     open(output_file_2, 'wb') as file_out_2:

    reader = csv.reader(file_in)
    writer_1 = csv.writer(file_out_1)
    writer_2 = csv.writer(file_out_2)
    writer_1.writerow(columns_1)
    writer_2.writerow(columns_2)
    for line in reader:
        line = line[0].split(' ')
        writer_1.writerow([line[n] for n in [0, 2, 3, 4, 6]])
        writer_2.writerow([line[n] for n in [1, 2, 3, 5, 7]])

!cat output_1.csv
id_1,time_1,time_2,gps_1,gps_2
00022d3f5b17,1073260801,1073260803,819251,819251
00022d9064bc,1073260801,1073260803,819251,819251
00022d9064bc,1073260801,1073260803,819251,819251
00022d9064bc,1073260801,1073260803,819251,819251
00022d9064bc,1073260801,1073260803,819251,819251
00022d9064bc,1073260803,1073260810,819213,819213
00904b4557d3,1073260803,1073261920,817526,817526
00022de73863,1073260804,1073265410,817558,817558

!cat output_2.csv
id_2,time_1,time_2,gps_3,gps_4
00022d9064bc,1073260801,1073260803,440006,440006
00022dba8f51,1073260801,1073260803,440006,440006
00022de1c6c1,1073260801,1073260803,440006,440006
003065f30f37,1073260801,1073260803,440006,440006
00904b48a3b6,1073260801,1073260803,440006,440006
00904b83a0ea,1073260803,1073260810,439954,439954
00904b85d3cf,1073260803,1073261920,439458,439458
00904b14b494,1073260804,1073265410,439525,439525
Alexander
  • 105,104
  • 32
  • 201
  • 196