1

Question 1.

I tried reading a CSV file with size of ~1GB like below

import csv

res = []
with open("my_csv.csv", "r") as f:
    reader = csv.reader(f)
    for row in reader:
        res.append(row)

I thought 1GB is small enough to load it on my memory as a list. But in fact, the code freezes and the memory usage was 100%. I had checked the extra memory was few GB before I ran the code.

This answer says,

"You are reading all rows into a list, then processing that list. Don't do that."

But I wonder WHY? Why does the list possesses much much bigger memory than the file size?


Question 2.

Is there any method to parse a CSV into a dict without a memory issue?

For example,

CSV

apple,1,2,a    
apple,4,5,b    
banana,AAA,0,3    
kiwi,g1,g2,g3

Dict

{"apple" : [[1, 2, a], [4, 5, b]],
 "banana": [[AAA, 0, 3]],
 "kiwi"  : [[g1, g2, g3]]}
sssbbbaaa
  • 204
  • 2
  • 12
  • Are you running a 32-bit Python? How much RAM do you have? – Tim Roberts Jan 07 '22 at 01:04
  • The fact that this information comes from a CSV is irrelevant. You're just asking how much memory it takes to hold millions of lists of strings. – Barmar Jan 07 '22 at 01:08
  • Right. Each row consists of a list object and 4 string objects plus the string data. That does take more memory, especially if the fields are small, like your example. – Tim Roberts Jan 07 '22 at 01:14
  • @TimRoberts 64-bit and 16GB. So the extra is at least 10GB. – sssbbbaaa Jan 07 '22 at 01:25
  • Please explain the processing you want to do on the data. It might be possible to perform this a row at a time whilst reading it in. e.g. if you are just counting things. This way the file could be any size without causing a memory issue. – Martin Evans Jan 07 '22 at 10:58

2 Answers2

2

Appending millions of elements to a list in a loop like that can be inefficient, because periodically the list grows beyond its current allocation and has to be copied to a new area of memory increase its size. This will happen over and over with larger lists, so it becomes an exponential process.

You might be better off using the list() function, which may be able to do it more efficiently.

with open("my_csv.csv", "r") as f:
    reader = csv.reader(f)
    res = list(reader)

Even if it still has the same memory issues, it will be faster simply because the loop is in optimized C code rather than interpreted Python.

There's also overhead from all the lists themselves. Internally, a list has some header information, and then pointers to the data for each list element. There can also be excess space allocated to allow for growth without reallocating, but I suspect the csv module is able to avoid this (it's uncommon to append to lists read from a CSV). This overhead is usually not significant, but if you have many lists and the elements are small, the overhead can come close to doubling the memory required.

For your second question, you should heed the advice in the question you linked to. Process the file one record at a time, adding to the dictionary as you go.

result = {}
with open("my_csv.csv", "r") as f:
    reader = csv.reader(f)
    for row in reader:
        result.setdefault(row[0], []).append(row[1:])
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • "list grows beyond its current allocation and has to be copied to a new area of memory increase its size" Does this mean when I append an element to a list, a totally new list is created in memory space? Then I conjecture old lists(before appending) are not removed from the memory immediately. (e.g., if I append a range from 0 to 5 then a memory keeps all the lists [], [0], [0, 1], ..., and [0, 1, 2, 3, 4, 5]) Do I understand correctly? – sssbbbaaa Jan 07 '22 at 01:33
  • It doesn't do it every time. When it reallocates, it adds extra space to allow for growth. So if you have a list of 10 elements, and append to it, it might allocate space for 20 elements. Then you can add 9 more elements before it needs to reallocate and copy. – Barmar Jan 07 '22 at 01:37
  • The garbage collector immediately reclaims the old memory. – Barmar Jan 07 '22 at 01:38
  • Ah okay. I confused what you said 'exponential process'. Roughly it possesses twice. I should check how a list uses memory. Thanks for detailed explain. – sssbbbaaa Jan 07 '22 at 01:44
  • It's still exponential, because it then has to grow again when it reaches 20, and then 40, and so on. – Barmar Jan 07 '22 at 01:45
0

To answer you second question:

Is there any method to parse a CSV into a dict without a memory issue?

You're not saying what a "memory issue" is, but if you're parsing a CSV into a dict in Python, you're going to use more memory than the CSV itself.

I created a script to generate "big" CSVs and then monitored time and peak memory consumption using @Barmar's code to build the result dict and noticed that on average that code used 10X more memory than the size of the CSV.

Below are my results from processing 3 of those "big" files, one with 100K rows, with 1M rows, and one with 10M rows.

The stats for the csv-to-dict process of each file is shown in the 3 blocks below:

  • The first line is from ls -h <CSV-FILE>
  • The next two lines are from /usr/bin/time -l <CSV-FILE>
715M Jan  6 19:44 gen_10000000x10.csv
55.98 real        49.54 user         4.33 sys
7.46G  peak memory footprint
---
72M  Jan  6 19:47 gen_1000000x10.csv
4.66 real         4.49 user         0.15 sys
753M  peak memory footprint
---
7.2M Jan  6 19:44 gen_100000x10.csv
0.35 real         0.32 user         0.02 sys
79M  peak memory footprint
Zach Young
  • 10,137
  • 4
  • 32
  • 53