Sum Each Column of CSV file

Question

I have a large CSV file with 32 column headers. I'd like to sum up each column and the result be 32 individual summations of each column header. I have access to both python and powershell. Any help would be appreciated.

The furthest I got was this site: pandas groupby with sum() on large csv file?

Import into excel and just do it there? It's not programmatic, but if you only need to do it once it will probably be the fastest way. — Tyler, Jun 02 '16 at 21:27
Just use a simple `for` loop, so you don't need to read the whole file into memory. — leekaiinthesky, Jun 02 '16 at 21:31
What I mean is that I am quite inexperienced with python and was not sure how to perform the method provided in the link. I am unsure how to skip over the "groupby" section. — specmer, Jun 02 '16 at 21:34
@specmer: So if you are yet inexperienced with python my answer below should help you get going. No import no framework and you can easily digest a gigabyte worth of data as you described ;-) we all started inexperienced, and knowing this as a strength we sometimes forget ... — Dilettant, Jun 02 '16 at 21:56

Vincent Claes · Accepted Answer · 2016-06-02T22:15:45.173

1

import pandas as pd
pd.read_csv(r'my_path_to_file/my_file.csv', sep=';').sum().values

Pandas is definitly the way to go . these two lines of code will print out the sum of the columns. if you are on windows use a '\' for specifying your path. I assume your csv file uses a semicolon as a seperator (if its a comma use sep=',' if its a tab use sep='\t')

If you want to write the result to a file use:

import pandas as pd
df = pd.read_csv(r'my_path_to_file/my_file.csv', sep=';').sum()
df.to_csv(r'my_path_to_file/my_file_sum.csv')

edited Jun 02 '16 at 22:15

answered Jun 02 '16 at 21:56

Vincent Claes

3,960
3
44
62

So I've pretty much put together a concoction of things and this is what got me my best results: `$ ipython In [1]: import dask.dataframe as dd In [2]: df = dd.read_csv('so.csv', sep=',') In [3]: df.head() Out [3]: In [5]: df.sum().compute() Out[5]: ` – specmer Jun 02 '16 at 22:10
The difference between Vincent Claes and my code is that mine included headers. – specmer Jun 02 '16 at 22:11
@specmer if you want to include headers just run the code to save to a file. in this file you will see the column names with the corresponding values. – Vincent Claes Jun 02 '16 at 22:17

score 1 · Answer 2 · answered Jan 28 '20 at 09:50

In powershell (or Linux/Mac OS etc) you should be able to install the excellent CSVFIX command-line package (which works very fast on large CSV files and also has a Windows installer).

You can use the CSVFIX summary command to generate a sum of each column:

csvfix summary -sum 1:32 filename.csv

This will give you a single-line summary of the sum of each column:

"43", "21", "425", "1092", [...]

If the file has a header row don't forget to also add the -ifn flag to ignore the first row.

score 0 · Answer 3 · answered Jun 02 '16 at 21:37

0

You can use read_csv in pandas to read the file, and then just use sum() on the dataframe.

import pandas as pd

filename = r'folder/file.txt'
df = pd.read_csv(filename)
total = df.sum()

answered Jun 02 '16 at 21:37

Batman

8,571
7
41
80

So something like this: df.sum().compute() – specmer Jun 02 '16 at 21:40
What am I doing wrong? `In [1]: import pandas as pd In [2]: filename = r'C:\Temp\Symphony\output.csv' In [3]: df = pd.read_csv(r) --------------------------------------------------------------------------- NameError Traceback (most recent call last) in () ----> 1 df = pd.read_csv(r) NameError: name 'r' is not defined In [4]: df = pd.read_csv('C:\Temp\Symphony\output.csv') In [5]: total = df.sum() In [6]:` – specmer Jun 02 '16 at 21:50
`df = pd.read_csv(r)` should be `df = pd.read_csv(filename)`. – Batman Jun 02 '16 at 21:51
ah. gotcha. once I type `total = df.sum()` I don't get any output – specmer Jun 02 '16 at 21:55
`total` is a Pandas series. Try `print(total)`. Or `Total['A']`, where 'A' is a column name. – Batman Jun 02 '16 at 21:58

Dilettant · Answer 4 · 2016-06-02T21:54:58.867

A simple way using only builtins on this sample data file:

#! /usr/bin/env python
from __future__ import print_function

sep = ';'
with open('32_numeric_columns.csv', 'rt') as f:
    columns = f.readline().strip().split(sep)
    rows = [0] * len(columns)
    for line in f.readlines():
        data = line.strip().split(sep)
        for i, cell in enumerate(data, start=0):
            rows[i] += float(cell)


print(columns)
print(rows)

on this data file:

a0;a1;a2;a3;a4;a5;a6;a7;a8;a9;b0;b1;b2;b3;b4;b5;b6;b7;b8;b9;c0;c1;c2;c3;c4;c5;c6;c7;c8;c9;d0;d1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1

yields:

['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 'b8', 'b9', 'c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'd0', 'd1']
[10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0]

Working on a huge file with 1280000000 bytes of data took approx. 5 minutes on my machine to produce:

$> time ./so_csv_adder.py
['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 'b8', 'b9', 'c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'd0', 'd1']
[20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0]

real    4m47.374s
user    4m43.748s
sys 0m2.545s

score 0 · Answer 5 · answered Jun 02 '16 at 22:19

import csv
with open('yourBigFile.csv', 'rb') as f:
    spreadsheet=csv.reader(f) #you may need some options 
                              #depending on the format of the file
    header=None
    for row in spreadsheet:
        if header is None:
            header=row
            mySums=[0]*len(row) #  initialize to zero
            continue
        else:
            # this will only work if every cell has a number
            #   this will be faster, so use it if it is possible
            #   in your application
            #mySums=[mySums[x]+float(row[x]) for x in range(len(mySums))]

            # more generally
            for i,x in enumerate(row):
                try:
                    converted=float(x)
                except ValueError:   #you may actually want an error
                                     #raised.  YMMV depending on your data
                    converted=0
                mySums[i]+=converted

As I am not sure how you want the output to be formatted, I will leave that to you.

Sum Each Column of CSV file

5 Answers5

Linked