How to delete column from .csv file without reading the whole file

Question

I generate very big .csv file but now it doesn't fit the RAM. So i decided to delete some inefficient columns to reduce the file size. How can I do that?

I tried data = pd.read_csv("file.csv", index_col=0, usecols=["id", "wall"]) but it still doesn't fit the RAM.

File is about 1.5GB, RAM is 8GB.

do you need to read the whole csv for those columns? One thing that may help is to specify the dtype so it doesn't try to guess — EdChum, Sep 12 '16 at 09:53
@EdChum Yes, I need to read the index and "wall" columns. I know about "nrows" parameter but i need the ability to read the whole file. — Daniil Okhlopkov, Sep 12 '16 at 09:59
possible dumb question are you running 64-bit python, OS and associated libraries? — EdChum, Sep 12 '16 at 10:00
A 1.5 GB file doesn't fit into 8 GB of RAM? Anyways, Apache Spark may be overkill, but worth a shot — OneCricketeer, Sep 12 '16 at 10:05
Or stream the file though some command line operations rather than trying to store a whole dataframe in pandas — OneCricketeer, Sep 12 '16 at 10:06

score 1 · Answer 1 · edited May 23 '17 at 12:22

Instead of deleting columns, you can also read specific columns from csv file using a DictReader (if you're not using Pandas ).

import csv
from StringIO import StringIO

columns = 'AAA,DDD,FFF,GGG'.split(',')


testdata ='''\
AAA,bbb,ccc,DDD,eee,FFF,GGG,hhh
1,2,3,4,50,3,20,4
2,1,3,5,24,2,23,5
4,1,3,6,34,1,22,5
2,1,3,5,24,2,23,5
2,1,3,5,24,2,23,5
'''

reader = csv.DictReader(StringIO(testdata))

desired_cols = (tuple(row[col] for col in columns) for row in reader)

Output:

>>> list(desired_cols)
[('1', '4', '3', '20'),
 ('2', '5', '2', '23'),
 ('4', '6', '1', '22'),
 ('2', '5', '2', '23'),
 ('2', '5', '2', '23')]

Source: https://stackoverflow.com/a/20065131/6633975

Using Pandas:

Here is an example illustrating the answer given by EdChum. There is a lot of additional options to load a CSV file, check the API reference.

import pandas as pd


raw_data = {'first_name': ['Steve', 'Guido', 'John'],
        'last_name': ['Jobs', 'Van Rossum', "von Neumann"]}
df = pd.DataFrame(raw_data)
# Saving data without header
df.to_csv(path_or_buf='test.csv', header=False)
# Telling that there is no header and loading only the first name
df = pd.read_csv(filepath_or_buffer='test.csv', header=None, usecols=[1], names=['first_name'])
df

  first_name
0      Steve
1      Guido
2       John

score 0 · Answer 2 · answered Sep 12 '16 at 10:37

0

I am not sure if this is possible in pandas. You can try to do it in the Command Line. On Linux it will look like:

cut -f1,2,5- inputfile

if you want to delete columns with indexes 3 and 4.

answered Sep 12 '16 at 10:37

vlad.rad

1,055
2
10
28

cat: invalid option -- 'f' – Daniil Okhlopkov Sep 12 '16 at 11:40
Look here: http://stackoverflow.com/questions/13446255/how-to-remove-the-first-two-columns-in-a-file-using-shell-awk-sed-whatever – vlad.rad Sep 12 '16 at 12:01

How to delete column from .csv file without reading the whole file

2 Answers2