0

I have a CSV file which has over a million rows and I am trying to parse this file and insert the rows into the DB.

    with open(file, "rb") as csvfile:

        re = csv.DictReader(csvfile)
        for row in re:
        //insert row['column_name'] into DB

For csv files below 2 MB this works well but anything more than that ends up eating my memory. It is probably because i store the Dictreader's contents in a list called "re" and it is not able to loop over such a huge list. I definitely need to access the csv file with its column names which is why I chose dictreader since it easily provides column level access to my csv files. Can anyone tell me why this is happening and how can this be avoided?

Tania
  • 1,855
  • 1
  • 15
  • 38
  • 1
    http://stackoverflow.com/questions/24868856/reading-rows-of-big-csv-file-in-python – DhruvPathak Apr 23 '15 at 06:30
  • 1
    Although not answering your actual question, if you need to load the data as is, it could be easier and faster to use the DB's own facilities (for example, `COPY table(col1, col2) FROM file WITH CSV` in Postgres or `LOAD DATA INFILE` in MySQL). – Vadim Landa Apr 23 '15 at 06:31

2 Answers2

4

The DictReader does not load the whole file in memory but read it by chunks as explained in this answer suggested by DhruvPathak.

But depending on your database engine, the actual write on disk may only happen at commit. That means that the database (and not the csv reader) keeps all data in memory and at end exhausts it.

So you should try to commit every n records, with n typically between 10 an 1000 depending on the size of you lines and the available memory.

Community
  • 1
  • 1
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
1

If you don't need the entire columns at once, you can simply read the file line by line like you would with a text file and parse each row. The exact parsing depends on your data format but you could do something like:

delimiter = ','
with open(filename, 'r') as fil:
    headers = fil.next()
    headers = headers.strip().split(delimiter)
    dic_headers = {hdr: headers.index(hdr) for hdr in headers}
    for line in fil:
        row = line.strip().split(delimiter)
        ## do something with row[dic_headers['column_name']]

This is a very simple example but it can be more elaborate. For example, this does not work if your data contains ,.

Julien Spronck
  • 15,069
  • 4
  • 47
  • 55
  • Can you please tell me which line in my previous code takes all the cols at once? – Tania Apr 23 '15 at 06:43
  • AFAIK the csv readers internally already iterate over lines and do not load all file in memory, so I highly doubt that this really solves OP's problem – Serge Ballesta Apr 23 '15 at 06:43
  • I just saw that and uprooted your answer. we learn every day. – Julien Spronck Apr 23 '15 at 06:44
  • headers = headers.strip().split(delimiter) results in error built_in_method has no attr split – Tania Apr 23 '15 at 06:44
  • @Tania As suggested by DhruvPathak and Serge Ballesta, this is most likely not solving your memory error as the `DictReader` does not put the entire file into memory – Julien Spronck Apr 23 '15 at 06:47
  • @Tania if `headers` is a string (which it should be), `headers = headers.strip().split(delimiter)` should not result in that error. – Julien Spronck Apr 23 '15 at 06:49