0

I'm loading in a csv file line by line because it has ~800 million lines in it and there are many of these files which I need to analyse so loading in parallel is paramount and loading line by line is also required so as to not blow up the memory.

I have been given an answer to how to calculate the number of entries in which unique IDs are present throughout the dataset using collections.Counter(). (see Counting csv column occurrences on the fly in Python)

But is there a way to calculate a running total of data entries in another column of a read-in line for each unique ID from another column?

eg. suppose that the data in your csv file had only two columns and therefore looked like the following:

 [1 1]
 [1 1]
 [2 2]
 [3 2]
 [2 2]
 [1 2]

Where the second column contains the unique IDs for which you want to keep a running total of values in the first column. So your output should look like the following:

{'1': 2, '2': 8}

Where for ID '1' in column two, the total is given by 1+1 in column one. And for ID '2' in column one, the total is given by 2+3+2+1.

How can one do this quickly given the vast size of the csv I'm working with?

import csv

features = {}

with open(filename) as f:
        reader = csv.reader(f,delimiter=',')                
        for row in reader:            
                ID = row[1]               
                if SrcDevice not in features.keys():
                        features[ID] = {}
                        features[ID]['Some_feature'] = 0                        
                features[SrcDevice]['Some_feature'] += float(row[0])

But doing so with so many lines takes far far too long. The idea here is that I would also create a similar dictionary but with number of occurrences so that I can divide the features dictionary by that dictionary to calculate a mean of the features. (Seems unnecessary but remember that these files are so large they are read in line by line).

Using sqlite3 has been mentioned in the linked question but I would be interested in seeing if this can be done efficiently in Python first. Thanks in advance!

user8188120
  • 883
  • 1
  • 15
  • 30
  • This is too broad. First you talk about files, then you show a list that cannot possibly be a csv file, then we are back to code reading files. What is the specific problem? Do you need an explanation how to read csv files or how to group lists such as shown by the criteria you gave? – timgeb Nov 29 '18 at 16:56
  • See, now I can see why you maybe had an error. Why the nested dictionary when counting just one column? Otherwise, use `features = defaultdict(lambda: {'Some_feature': 0})` to create a new dictionary with `Some_feature` pre-populated for new `ID` values. – Martijn Pieters Nov 29 '18 at 16:57
  • 2
    I read "I'm loading in a csv file [that] has ~800 million lines" and "there are many of these files" and I about had an aneurysm. Databases are cool, man! Use 'em! – Adam Smith Nov 29 '18 at 16:57
  • @Martijn Pieters♦ The nested dictionary is because I will be calculating ongoing totals for more than one feature per ID, I just tried to simplify it in this question to make it more understandable what it is I'm trying to do. – user8188120 Nov 29 '18 at 17:02
  • 2
    Agree w/ @AdamSmith. Your IO is going to be your bottleneck so it doesn't matter how clever you are in python - it's going to be slow. You need a db. – keithpjolley Nov 29 '18 at 17:04
  • @keithpjolley right. Either load it all into memory (eek!) or use a DB -- there really isn't a performant middle ground. – Adam Smith Nov 29 '18 at 17:06
  • @AdamSmith Smith : I assume by a database you mean using something like sqlite3? – user8188120 Nov 29 '18 at 17:08
  • @Martijn Pieters♦ I'm still confused by the "features = defaultdict(lambda: {'Some_feature': 0})" line though, just out of curiosity I think I'm misunderstanding the meaning of the line as if I just run that line I get an error that defaultdict is not defined – user8188120 Nov 29 '18 at 17:09
  • @user8188120 you need to import defaultdict from `collections.` Yes, I mean using something like sqlite3 (though there are many many many options) – Adam Smith Nov 29 '18 at 17:10
  • @user8188120: sorry, I assumed you'd have figured out how to import that type. `from collections import defaultdict`. Do check the documentation of things people recommend! – Martijn Pieters Nov 29 '18 at 17:10
  • And you *really* want to go with the database idea here now. This went from 200 million to 800 million rows and multiple data points. Had you loaded this into even sqlite you'd have had your answers already. `SELECT id, sum(some_column) FROM tablename GROUP BY id`. – Martijn Pieters Nov 29 '18 at 17:11
  • Ah apologies I meant 200 million in this thread as well.. it's been a long day! True but I wanted to see how far Python could be pushed in these extreme data size circumstances, plus my SQL is not the best. But thanks anyway everyone I'll try using a database now – user8188120 Nov 29 '18 at 17:14

0 Answers0