Using Python to parse a 12GB CSV

Question

I have a 12 GB CSV file. I'm hoping to extract only some columns from this data and then write a new CSV that hopefully I can load into R for analysis.

The problem is that I'm getting a memory error when trying to load the entire list at once before writing the new CSV file. How can I parse the data row by row and then create a CSV output?

Here is what I have so far:

import pandas

colnames = ['contributor name', 'recipient name', 'recipient party', 'contributor cfscore', 'candidate cfscore', 'amount']

DATA = pandas.read_csv('pathname\filename.csv', names=colnames)
DATA.to_csv(''pathname\filename.csv', cols = colnames)

Have you looked at csvkit? I have no idea how it is in terms of efficiency or speed. http://csvkit.readthedocs.org/en/0.7.3/ — A5C1D2H2I1M1N2O1R2T1, May 25 '14 at 17:53
Try the [Google CRUSH Tools](https://code.google.com/p/crush-tools/). That toolset has a wide array of data manipulation routines that should let you slice, dice and even do some preliminary aggregation to get the data into a more manageable size. Also, If you can get the data into a simple database, `dplyr` could be an excellent choice for crunching the data since it most likely won't eat up all your memory. — hrbrmstr, May 25 '14 at 19:24

GSee · Answer 1 · 2014-05-25T19:01:01.747

8

In R, you can use the fread function from the popular data.table package.

You can use the drop= argument to specify columns not to be read -- no memory is allocated for them, and they are not read at all. Or select= the columns you want to keep, if that is more convenient. fread can read csv files very, very quickly.

If you're dealing with this much data, you'll probably want to familiarize yourself with the data.table package anyway.

Alternatively, ?read.csv.sql from the sqldf package says it will

Read a file into R filtering it with an sql statement. Only the filtered portion is processed by R so that files larger than R can otherwise handle can be accommodated.

Here's the example:

write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv", 
                      sql="select * from file where Species = 'setosa' ")

edited May 25 '14 at 19:01

answered May 25 '14 at 17:59

GSee

48,880
13
125
145

Thanks GSee. I tried using fread as you described before posting this, but after about 25 mins or so, would get an error regarding the seperators... – ModalBro May 25 '14 at 20:08
@user3342483 maybe [this bug](http://stackoverflow.com/questions/23833294/data-tablefread-doesnt-like-missing-values-in-first-column) is biting. :( It would be great if you could run it again with `verbose=TRUE` and send the output to the datatable-help list. Or if the file is available online and you could provide a link, it would help to debug. – GSee May 25 '14 at 20:16
The file is available at this link, (be warned, it's a 2GB): https://dl.dropboxusercontent.com/u/2047562/contribDB_2012.zip The error R gives me is this: Error in fread("~/BonicaIdeology/contribDB_2012.csv", verbose = T, select = c("contributor name", : Expected sep (',') but ' ' ends field 20 on line 35265 when reading data: 2012,"e:ind:2012:32748445","15",75,"2011-09-30",4185397934,"arbitter, daniel s s","arbitter","daniel s","s","","","daniel s s","I","M","14871 lakewood dr","plymouth","MI",481702694,"manager, core veh dyn – ModalBro May 25 '14 at 20:57
2

@user3342483 Thanks for the file. I downloaded and inspected it. That line (35,265) has a `\n` in field 20. To look around that line I used : `head -n 35267 contribDB_2012.csv | tail -n 4`. `fread` doesn't cope with embedded `\n` yet. I'll take a look ... – Matt Dowle May 26 '14 at 13:43
1

@user3342483 The embedded \n problem is [now fixed](http://stackoverflow.com/a/21009264/403310). I've tested on your 12GB file and I'm able to `select=` columns from it ok. Several fields had series of backslashes which is also fixed and tests added. [Latest v1.9.3](https://github.com/Rdatatable/data.table/) should work now. – Matt Dowle Jun 24 '14 at 03:12

score 5 · Accepted Answer · answered May 25 '14 at 17:59

5

You could use the csv module to process the file line-by-line. Something like this might work:

import csv
infname = "csv.csv"
outfname = "csv_stripped.csv"
cols = ["col1", "col2", "col3"]
with open(infname) as inf, open(outfname, 'w+') as outf:
    reader = csv.DictReader(inf)
    writer = csv.DictWriter(outf, cols, extrasaction='ignore')
    writer.writeheader()
    for line in reader:
        writer.writerow(line)

For reference:

https://docs.python.org/2/library/csv.html

answered May 25 '14 at 17:59

moooeeeep

31,622
22
98
187

@user3342483 Probably you've got an error somewhere (maybe column names or delimiters?). I just made up a simple test, it works for me. – moooeeeep May 25 '14 at 20:16
Sorry, I deleted my last comment by accident. I think the spaces in my col names are throwing things off.... – ModalBro May 25 '14 at 20:41

ping · Answer 3 · 2014-05-26T11:50:29.900

2

You don't need python or any extra R packages to do this. You can use the colClasses argument with data input in R to only read in certain columns. Give a column a colClasses value of "NULL" to prevent it being loaded during input.

Example for a file ("myfile.csv") with 3 columns, if the first should be treated as character, the third should be treated as integer, and the second should be left out:

read.csv("myfile.csv", colClasses=c("character", "NULL", "integer"))

This method requires you to specify classes for every column in your file though.

It might also be worth you reading the advice for memory usage at http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html if you have many rows.

edited May 26 '14 at 11:50

answered May 25 '14 at 18:48

ping

1,316
11
14

Technically this may of course work, but the performance of read.csv on a 12GB file will certainly be an issue. – Rahul Premraj May 28 '14 at 05:55
True, this would be faster than reading the entire file with no "NULL"s, but still quite slow. – ping May 28 '14 at 15:25

score 1 · Answer 4 · answered May 25 '14 at 17:59

Reading all data into memory first is not a good idea.

There are two ways how to deal with such large data files.

Stream each line, pick relevant information and dismiss the rest until EOF. (see xml parser)
Stream each line and write the entire data into a much better format which let you easily access the relevant information. E.g. pytables

score 1 · Answer 5 · answered May 25 '14 at 18:01

If you have a function filter_row which return True when you want to keep the row, you can use:

with open('input.csv') as inp:
    with open('output.csv', 'w') as outp:
        for line in inp:
            if filter_row(line):
                outp.write(line)

score 0 · Answer 6 · answered May 25 '14 at 20:53

You could achieve this using R and JDBC. Let's create a sample csv file.

write.table(x=mtcars, file="mtcars.csv", sep=",", row.names=F, col.names=T) # create example csv file

Download and save the the CSV JDBC driver from this link: http://sourceforge.net/projects/csvjdbc/files/latest/download and then setup the driver.

> library(RJDBC)

> path.to.jdbc.driver <- "jdbc//csvjdbc-1.0-18.jar"
> drv <- JDBC("org.relique.jdbc.csv.CsvDriver", path.to.jdbc.driver)
> conn <- dbConnect(drv, sprintf("jdbc:relique:csv:%s", getwd())) # replace getwd() with location of csv file

Now you can compose a query and select the columns you need and if needed, add a where clause to filter on the data to only select rows where gear takes the value 5:

> dbGetQuery(conn, "select mpg, cyl, gear from mtcars where gear = '5'")
   mpg cyl gear
1   26   4    5
2 30.4   4    5
3 15.8   8    5
4 19.7   6    5
5   15   8    5

RydallCooper · Answer 7 · 2014-05-25T18:00:24.107

-1

Try this:

    file = open("csv.csv", "r")
    for line in file: 
        print(line)

It won't load the whole file into memory, it's parsing line by line.

edited May 25 '14 at 18:00

answered May 25 '14 at 17:55

RydallCooper

19,094
1
11
17

`for line in file` should read one line at a time, without loading it all in memory – M4rtini May 25 '14 at 17:57
Python loads the file when you do `readlines()`. Iterating over the file (`for line in file: print(line)`) will not load the whole file into memory, and can in fact work with infinite streams. – Benjamin Hodgson May 25 '14 at 17:58
@AaronHall fixed the post, didn't realize `for line in lines` wouldn't actually load the whole file into memory. – RydallCooper May 25 '14 at 17:58
@M4rtini Corrected my answer. Thanks! – RydallCooper May 25 '14 at 17:59

Using Python to parse a 12GB CSV

7 Answers7