Most efficient way to parse a large .csv in python?

Question

I tried to look on other answers but I am still not sure the right way to do this. I have a number of really large .csv files (could be a gigabyte each), and I want to first get their column labels, cause they are not all the same, and then according to user preference extract some of this columns with some criteria. Before I start the extraction part I did a simple test to see what is the fastest way to parse this files and here is my code:

def mmapUsage():
    start=time.time()
    with open("csvSample.csv", "r+b") as f:
        # memory-mapInput the file, size 0 means whole file
        mapInput = mmap.mmap(f.fileno(), 0)
        # read content via standard file methods
        L=list()
        for s in iter(mapInput.readline, ""):
            L.append(s)
        print "List length: " ,len(L)
        #print "Sample element: ",L[1]
        mapInput.close()
        end=time.time()
        print "Time for completion",end-start

def fileopenUsage():
    start=time.time()
    fileInput=open("csvSample.csv")
    M=list()
    for s in fileInput:
            M.append(s)
    print "List length: ",len(M)
    #print "Sample element: ",M[1]
    fileInput.close()
    end=time.time()
    print "Time for completion",end-start

def readAsCsv():
    X=list()
    start=time.time()
    spamReader = csv.reader(open('csvSample.csv', 'rb'))
    for row in spamReader:
        X.append(row)
    print "List length: ",len(X)
    #print "Sample element: ",X[1]
    end=time.time()
    print "Time for completion",end-start

And my results:

=======================
Populating list from Mmap
List length:  1181220
Time for completion 0.592000007629

=======================
Populating list from Fileopen
List length:  1181220
Time for completion 0.833999872208

=======================
Populating list by csv library
List length:  1181220
Time for completion 5.06700015068

So it seems that the csv library most people use is really alot slower than the others. Maybe later it proves to be faster when I start extracting data from the csv file but I cannot be sure for that yet. Any suggestions and tips before I start implementing? Thanks alot!

Code for correctness first and then for speed -- ie. start with the `csv` module. If you want to do more performance testing instead of starting with `csv`, add a check to make sure your other methods are parsing the data correctly. Your fileopenUsage() looks like a good place to add a test for correctness. — istruble, Jan 31 '12 at 21:23
Your first 2 methods are NOT parsing each line into fields. Therefore, your benchmarks don't measure what you claim they measure. — S.Lott, Jan 31 '12 at 21:29
Try the `timeit` module in the future for benchmarks like this. — nfirvine, Feb 01 '12 at 02:01

score 16 · Accepted Answer · edited Jul 24 '19 at 21:05

As pointed out several other times, the first two methods do no actual string parsing, they just read a line at a time without extracting fields. I imagine the majority of the speed difference seen in CSV is due to that.

The CSV module is invaluable if you include any textual data that may include more of the 'standard' CSV syntax than just commas, especially if you're reading from an Excel format.

If you've just got lines like "1,2,3,4" you're probably fine with a simple split, but if you have lines like "1,2,'Hello, my name\'s fred'" you're going to go crazy trying to parse that without errors.

CSV will also transparently handle things like newlines in the middle of a quoted string. A simple for..in without CSV is going to have trouble with that.

The CSV module has always worked fine for me reading unicode strings if I use it like so:

f = csv.reader(codecs.open(filename, 'rU'))

It is plenty of robust for importing multi-thousand line files with unicode, quoted strings, newlines in the middle of quoted strings, lines with fields missing at the end, etc. all with reasonable read times.

I'd try using it first and only looking for optimizations on top of it if you really need the extra speed.

thanks alot for the information. Its my first time in python so everything counts. — spagi, Jan 31 '12 at 22:32

score 3 · Answer 2 · answered Jan 31 '12 at 21:30

How much do you care about sanitization?

The csv module is really good at understanding different csv file dialects and ensuring that escaping is happing properly, but it's definitely overkill and can often be way more trouble than it's worth (especially if you have unicode!)

A really naive implementation that properly escapes \, would be:

import re

def read_csv_naive():
    with open(<file_str>, 'r') as file_obj:
      return [re.split('[^\\],', x) for x in file_obj.splitlines()]

If your data is simple this will work great. If you have data that might need more escaping, the csv module is probably your most stable bet.

He's reading a log file, from what we can gather. That means he has very little control over the content/structure of what he's reading. We can only assume that he requires a lot of "sanitization" otherwise stuff will behave incorrectly. — Zoran Pavlovic, Jan 15 '13 at 16:31

user2827947 · Answer 3 · 2013-12-14T15:49:52.470

To read large csv file we have to create child process to read the chunks of file. Open the file to get the file resource object. Create a child process, with resource as argument. Read the set of lines as chunk. Repeat the above 3 steps until you reach the end of file.

from multiprocessing import Process

def child_process(name):
    # Do the Read and Process stuff here.if __name__ == '__main__':
    # Get file object resource.
    .....
    p = Process(target=child_process, args=(resource,))
    p.start()
    p.join()

For code go to this link. This will helps you. http://besttechlab.wordpress.com/2013/12/14/read-csv-file-in-python/

score 1 · Answer 4 · answered Jan 31 '12 at 21:20

1

Your first 2 methods are NOT parsing each line into fields. The csv way is parsing out rows (NOT the same as lines!) of fields.

Do your really need to build a list in memory of all the lines?

answered Jan 31 '12 at 21:20

John Machin

81,303
11
141
189

I am not sure yet what the actual implementation will look like. I still dont know what kind of queries the user will return to me, and according to these queries I have to extract the information needed from the csv files. I forgot to mention that the csv files contain logs. A query could look like :"return for the last 2 days the temperature(one of the columns)" – spagi Jan 31 '12 at 21:26
1

@spagi: Please **update** the question to clarify your requirements. Your benchmarks are measuring **different** things. Your first two examples do **not** parse the file. Please fix your benchmarks to include complete parsing of the CSV file before claiming that they're slower. – S.Lott Jan 31 '12 at 21:29
But my intention was to do the same thing with all 3. Putting the contents of the file in a list does not include parsing? – spagi Jan 31 '12 at 21:30
1

@spagi: putting the contents of the file in a list is a good way of wasting memory if you don't need to, not something you should do willy-nilly with 1GB files. – John Machin Jan 31 '12 at 21:36
well this was for the measurement mostly. For the real implementation I dont intend to put everything in memory. I am not sure of the best way yet anyway that is why I am looking for tips – spagi Jan 31 '12 at 21:38
@spagi: Benchmarking things that you are not going to use in a real implementation is a waste of time and can lead to bad decisions. A list of strings (methods 1 and 2) takes up less memory than a list of lists (method 3). – John Machin Jan 31 '12 at 22:17

Most efficient way to parse a large .csv in python?

4 Answers4

Linked