Math operation on very large CSV (roughly 25-30K lines) in Python3

Question

I got some help in a previous question about this, however this is quite a different issue so I thought a new question would be best...

Once a month I need to parse through a very large CSV file, which I would normally do manually in Excel, however I'm now wanting to do it automatically in Python.

The CSV is structured like so:

[IDNUMBER],[DATE1],[DATE2],[DATE3],[STRING-OF-WHAT-HAPPENED],[DATE3 - DATE2 IN DAYS],[ORIGINAL-FILENAME]

What I need is essentially a printed out display of the following (or to a file, but I don't need to hold on to this data, I just plug it into some charts):

For each original-filename (which can be up to 1200 rows), I need an average of [DATE3 - DATE2 IN DAYS]. For example:

12345,2011-06-12,2011-07-01,2011-07-2,1,['1100.csv']
54321,2011-06-12,2011-07-01,2011-07-3,2,['1100.csv']
23452,2011-06-12,2011-07-01,2011-07-4,3,['1100.csv']

The average would be 2, and I'd need to know that number, and it would be helpful to know how many per file as well, which in this example would be 3.

Then move on to the next original-filename (the last item in the row), until the end of the CSV.

In excel I would use autofilter and select each listing in that column and just select the [date3 - date2] column and just get an average, but it's kind of tedious and time consuming.

Thanks!

What's your question? The `csv` module would be useful for this. But I see you already know that from [your earlier question](http://stackoverflow.com/questions/7350851/). — Tom Zych, Sep 14 '11 at 13:00
Sorry. I'm not sure how to tell it to say "for all original_file1 do this, then for all original_file2 do the same thing, because they're not set variables I can manually code. Each month it might be different values. — mrmylanman, Sep 14 '11 at 13:21
The earlier question also explained what the `glob` module is for... — Tom Zych, Sep 14 '11 at 13:23
I was under the impression that glob was for pathname matching, not matching a pattern within a single file. All these values are in a single file (as the output of all those other files in the first question). Now I need to go through that single file and do the other operations listed above. If it's better to rework the program to output to many files, I can do that, but I assumed it'd be best to work with one file. I'm at a loss how to use glob in this context (all the data now being in one file). I am trying to learn how to use Python, and am clearly not very experienced in it. — mrmylanman, Sep 14 '11 at 13:28
Oh yeah, sorry, I did look at the data before but I forgot there were filenames in it. — Tom Zych, Sep 14 '11 at 13:29

score 0 · Answer 1 · answered Sep 14 '11 at 13:33

If I understand you correctly you can do it as follows (hopefully the code is self explanatory):

You can use 'csv' module in Python to read the file.

import csv

fileHandler = open('yourFile', 'rU')
csvReader = csv.reader(fileHandler)

#Get the first row to initialize variables
firstRow = csvReader.next()
columnToAverage = 5
originalFileColumn = 6
runningDaysSum = atoi(firstRow[columnToAverage])
totalRows = 1
originalFileName = firstRow[originalFileColumn]

result = {}

for row in csvReader:
    #Move to next row
    if originalFileName != row[originalFileColumn]:
        average = runningDaysSum/totalRows
        result[originalFileName] = (average, totalRows)
        originalFileName = row[originalFileColumn]
        totalRows = 0
        runningDaysSum = 0

    runningDaysSum += atoi(row[columnToAverage])
    totalRows += 1

#For last row
result[originalFileName] = (average, totalRows)

Hope it helps.

Note that this code assumes the filenames are sorted, which may not be the case. — Tom Zych, Sep 14 '11 at 13:38
You mean my pseudocode, which naturally would be more concise than a complete solution like yours :) — Tom Zych, Sep 14 '11 at 13:45

score 0 · Answer 2 · answered Sep 14 '11 at 13:38

Ok, I assume you've opened the file and can read and parse lines with csv. I suggest making a dictionary, with the keys being filenames, and the values being a tuple: (count, sum of datediffs). Something like this:

data = {}
while [read and parse line]:
    datediff = [DATE3 - DATE2]
    if filename not in data:
        data[filename] = (1, datediff)
    else:
        ct, sum = data[filename]
        data[filename] = (ct + 1, sum + datediff)

for fname in sorted(data.keys):
    ct, sum = data[filename]
    avg = sum / ct
    [print]

Thanks for the solution I will work on this and report back with how it goes! — mrmylanman, Sep 14 '11 at 13:50

Math operation on very large CSV (roughly 25-30K lines) in Python3

2 Answers2