Delete column(s) from very large CSV file using pandas or blaze

Question

I have a very large csv file (5 GB), so I do not want to load the whole thing into memory, and I want to delete one or more of its columns. I tried using the following code in blaze, but all it did was append the resulting columns to the existing csv file:

from blaze import Data, odo
d = Data("myfile.csv")
d = d[columns_I_want_to_keep]
odo(d, "myfile.csv")

Is there a way using either pandas or blaze to only keep the columns I want and delete the other ones?

does `odo` support file modes? if so try `odo(d, "myfile.csv", mode='w+')` to overwrite the file — EdChum, Jul 01 '16 at 15:42
It doesn't seem like it @EdChum; I tried both mode = "w" and mode = "w+" to no avail.... — Alex, Jul 01 '16 at 16:23

root · Accepted Answer · 2016-07-01T17:09:33.420

You can use dask.dataframe, which is syntactically similar to pandas, but does manipulations out-of-core so memory shouldn't be an issue. It also parallelizes the process automatically, so it should be fast.

import dask.dataframe as dd

df = dd.read_csv('myfile.csv', usecols=['col1', 'col2', 'col3'])
df.to_csv('output.csv', index=False)

Timings

I've timed each method posted so far on a 1.4 GB csv file. I kept four columns, leaving the output csv file at 250 MB.

Using Dask:

%%timeit
df = dd.read_csv(f_in, usecols=cols_to_keep)
df.to_csv(f_out, index=False)

1 loop, best of 3: 41.8 s per loop

Using Pandas:

%%timeit
chunksize = 10**5
for chunk in pd.read_csv(f_in, chunksize=chunksize, usecols=cols_to_keep):
    chunk.to_csv(f_out, mode='a', index=False)

1 loop, best of 3: 44.2 s per loop

Using Python/CSV:

%%timeit
inc_f = open(f_in, 'r')
csv_r = csv.reader(inc_f)
out_f = open(f_out, 'w')
csv_w = csv.writer(out_f, delimiter=',', lineterminator='\n')
for row in csv_r:
    new_row = [row[1], row[5], row[6], row[8]]
    csv_w.writerow(new_row)
inc_f.close()
out_f.close()

1 loop, best of 3:  1min 1s per loop

I'm surprised `df.to_csv('output.csv', index=False)` looks to work, when I tried it as per [dask documentation](https://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.Series.to_csv) it will create multiple files and you can't specify a single file name but instead it will chunk it into multiple files ("One filename per partition will be created. You can specify the filenames in a variety of ways."). — citynorman, Oct 02 '17 at 19:22

score 2 · Answer 2 · answered Jul 01 '16 at 15:46

2

i would do it this way:

cols2keep = ['col1','col3','col4','col6'] # columns you want to have in the resulting CSV file
chunksize = 10**5  # you may want to adjust it ... 
for chunk in pd.read_csv(filename, chunksize=chunksize, usecols=cols2keep):
    chunk.to_csv('output.csv', mode='a', index=False)

PS you may also want to consider migrating from CSV to PyTables (HDF5) if it's appropriate for you...

answered Jul 01 '16 at 15:46

MaxU - stand with Ukraine

205,989
36
386
419

but what if you had 100s of columns? You'd have to write them all out here. – Theo F Aug 13 '20 at 14:41
@TheoF, well you can easily generate it dynamically – MaxU - stand with Ukraine Aug 13 '20 at 14:46

score 1 · Answer 3 · answered Jul 01 '16 at 16:10

I deal with Large csv files a lot. Here is my solution:

import csv
fname_in = r'C:\mydir\myfile_in.csv' 
fname_out = r'C:\mydir\myfile_out.csv' 
inc_f = open(fname_in,'r')  #open the file for reading
csv_r = csv.reader(inc_f) # Attach the csv "lens" to the input stream - default is excel dialect
out_f = open(fname_out,'w') #open the file for writing
csv_w = csv.writer(out_f, delimiter=',',lineterminator='\n' ) #attach the csv "lens" to the stream headed to the output file
for row in csv_r: #Loop Through each row in the input file
    new_row = row[:]  # initialize the output row
    new_row.pop(5) #Whatever column you wanted to delete
    csv_w.writerow(new_row) 
inc_f.close()
out_f.close()

score 1 · Answer 4 · answered Dec 10 '19 at 14:01

Reading the original CSV by chunk and appending to a new file will print the header each time you save the new chunks to the disk. It can be avoided the following way:

cols_to_keep = ['col1', 'col2'] # or [0, 1]
add_header = True
chunksize = 10**5
for chunk in pd.read_csv(f_in, chunksize=chunksize, usecols=cols_to_keep):
    chunk.to_csv(f_out, mode='a', index=False, header=add_header)
    if add_header:
        # The header should not be printed more than one
        add_header = False

Delete column(s) from very large CSV file using pandas or blaze

4 Answers4