4

I am writing a very simple script that will count the number of occurence in a file. The file size is about 300Mb (15 million lines) and has 3 columns. Since I am reading the file line by line I don't expect python to use much memory. Maximum would be slightly above 300Mb to store the count dictionnary.

However when I look at activity monitor the memory usage go above 1.5Gb. What am I doing wrong ? If it is normal, could someone explain please? Thanks

import csv
def get_counts(filepath):
    with open(filepath,'rb') as csvfile:
        reader = csv.DictReader(csvfile, fieldnames=['col1','col2','col3'], delimiter=',')
        counts = {}
        for row in reader:

            key1 = int(row['col1'])
            key2 = int(row['col2'])

            if (key1, key2) in counts:
                counts[key1, key2] += 1
            else:
                counts[key1, key2] = 1

    return counts
joel goldstick
  • 4,393
  • 6
  • 30
  • 46
Romain
  • 741
  • 4
  • 16
  • Did you try the method `csv.reader` : `data = csv.reader(open(csvfile), delimiter=',')` and then `for row in data:` ? [source](http://lethain.com/handling-very-large-csv-and-xml-files-in-python/) – Till Apr 08 '16 at 12:11
  • You could also do `count[key1, key2] = count.get((key1, key2), 0) + 1` instead of the `if else` statement. – Till Apr 08 '16 at 12:15
  • maybe it's just a paste issue, but the third line is not indented (and therefore the 4th too), which will cause your script to issue an error. – Mathieu B Apr 08 '16 at 12:22
  • Where does it say that you are "reading the file line by line"? You are taking line by line from the DictReader. –  Apr 08 '16 at 12:24
  • 1
    I think the dictionary have enough overhead to go to 1.5 GB. I've tested for a simple dictionary with 1000 keys and the value 1 for them and it has 24k. – Florin Ghita Apr 08 '16 at 12:56
  • Use [`objgraph`](https://pypi.python.org/pypi/objgraph) to determine what exactly uses memory, otherwise it's just guessing. Or [`memory_profiler`](https://pypi.python.org/pypi/memory_profiler) – warvariuc Apr 08 '16 at 14:16
  • consider using something like SQLite instead of dictionary to keep your memory footprint small. https://docs.python.org/2/library/sqlite3.html – SKG Apr 08 '16 at 14:53

3 Answers3

2

I think it's quite ok, that Python uses so much memory in your case. Here is a test on my machine:

Python 2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
>>> file_size = 300000000
>>> column_count = 10
>>> average_string_size = 10
>>> row_count = file_size / (column_count * average_string_size)
>>> row_count
3000000
>>> import os, psutil, cPickle
>>> mem1 = psutil.Process(os.getpid()).memory_info().rss
>>> data = [{column_no: '*' * average_string_size for column_no in xrange(column_count)} for row_no in xrange(row_count)]
>>> mem2 = psutil.Process(os.getpid()).memory_info().rss
>>> mem2 - mem1
4604071936L
>>>

So the full list of 3000000 dict with 10 items with strings of length 10 is using more the 4GB of RAM.

In your case I don't think the csv data takes the RAM. It's your counts dictionary.

Another explanation would be that the dicts which are read from the csv file one by one are not immediately garbage collected (though I don't affirm that).

In any case use a specialized tool to see what is taking the memory, for example https://pypi.python.org/pypi/memory_profiler

P.S. Instead of doing

        if (key1, key2) in counts:
            counts[key1, key2] += 1
        else:
            counts[key1, key2] = 1

Do

from collections import defaultdict
...
counts = defaultdict(int)
...
counts[(key1, key2)] += 1
warvariuc
  • 57,116
  • 41
  • 173
  • 227
0

You could try something like that :

import csv

def get_counts(filepath):

    data = csv.reader(open(filepath), delimiter=',')
    # Remove the first line if headers
    fields = data.next()
    counts = {}

    [count[row[0], row[1]] = count.get((row[0], row[1]), 0) + 1 for row in data]

    return counts
Till
  • 4,183
  • 3
  • 16
  • 18
0

try this

from collection import Counter
import csv

myreader = csv.reader( open(filename, 'r'))
Counter([each[:-1] for row in myreader] )

Hope this helps.

sam
  • 1,819
  • 1
  • 18
  • 30