2

I have a large CSV file with 32 column headers. I'd like to sum up each column and the result be 32 individual summations of each column header. I have access to both python and powershell. Any help would be appreciated.

The furthest I got was this site: pandas groupby with sum() on large csv file?

Community
  • 1
  • 1
specmer
  • 389
  • 1
  • 5
  • 14
  • Import into excel and just do it there? It's not programmatic, but if you only need to do it once it will probably be the fastest way. – Tyler Jun 02 '16 at 21:27
  • The file is over 1GB. It is impossible to load into Excel. – specmer Jun 02 '16 at 21:29
  • 1
    Just use a simple `for` loop, so you don't need to read the whole file into memory. – leekaiinthesky Jun 02 '16 at 21:31
  • What I mean is that I am quite inexperienced with python and was not sure how to perform the method provided in the link. I am unsure how to skip over the "groupby" section. – specmer Jun 02 '16 at 21:34
  • @specmer: So if you are yet inexperienced with python my answer below should help you get going. No import no framework and you can easily digest a gigabyte worth of data as you described ;-) we all started inexperienced, and knowing this as a strength we sometimes forget ... – Dilettant Jun 02 '16 at 21:56

5 Answers5

1
import pandas as pd
pd.read_csv(r'my_path_to_file/my_file.csv', sep=';').sum().values

Pandas is definitly the way to go . these two lines of code will print out the sum of the columns. if you are on windows use a '\' for specifying your path. I assume your csv file uses a semicolon as a seperator (if its a comma use sep=',' if its a tab use sep='\t')

If you want to write the result to a file use:

import pandas as pd
df = pd.read_csv(r'my_path_to_file/my_file.csv', sep=';').sum()
df.to_csv(r'my_path_to_file/my_file_sum.csv')
Vincent Claes
  • 3,960
  • 3
  • 44
  • 62
  • So I've pretty much put together a concoction of things and this is what got me my best results: `$ ipython In [1]: import dask.dataframe as dd In [2]: df = dd.read_csv('so.csv', sep=',') In [3]: df.head() Out [3]: In [5]: df.sum().compute() Out[5]: ` – specmer Jun 02 '16 at 22:10
  • The difference between Vincent Claes and my code is that mine included headers. – specmer Jun 02 '16 at 22:11
  • @specmer if you want to include headers just run the code to save to a file. in this file you will see the column names with the corresponding values. – Vincent Claes Jun 02 '16 at 22:17
1

In powershell (or Linux/Mac OS etc) you should be able to install the excellent CSVFIX command-line package (which works very fast on large CSV files and also has a Windows installer).

You can use the CSVFIX summary command to generate a sum of each column:

csvfix summary -sum 1:32 filename.csv

This will give you a single-line summary of the sum of each column:

"43", "21", "425", "1092", [...]

If the file has a header row don't forget to also add the -ifn flag to ignore the first row.

Sam Critchley
  • 3,388
  • 1
  • 25
  • 28
0

You can use read_csv in pandas to read the file, and then just use sum() on the dataframe.

import pandas as pd

filename = r'folder/file.txt'
df = pd.read_csv(filename)
total = df.sum()
Batman
  • 8,571
  • 7
  • 41
  • 80
  • So something like this: df.sum().compute() – specmer Jun 02 '16 at 21:40
  • What am I doing wrong? `In [1]: import pandas as pd In [2]: filename = r'C:\Temp\Symphony\output.csv' In [3]: df = pd.read_csv(r) --------------------------------------------------------------------------- NameError Traceback (most recent call last) in () ----> 1 df = pd.read_csv(r) NameError: name 'r' is not defined In [4]: df = pd.read_csv('C:\Temp\Symphony\output.csv') In [5]: total = df.sum() In [6]:` – specmer Jun 02 '16 at 21:50
  • `df = pd.read_csv(r)` should be `df = pd.read_csv(filename)`. – Batman Jun 02 '16 at 21:51
  • ah. gotcha. once I type `total = df.sum()` I don't get any output – specmer Jun 02 '16 at 21:55
  • `total` is a Pandas series. Try `print(total)`. Or `Total['A']`, where 'A' is a column name. – Batman Jun 02 '16 at 21:58
0

A simple way using only builtins on this sample data file:

#! /usr/bin/env python
from __future__ import print_function

sep = ';'
with open('32_numeric_columns.csv', 'rt') as f:
    columns = f.readline().strip().split(sep)
    rows = [0] * len(columns)
    for line in f.readlines():
        data = line.strip().split(sep)
        for i, cell in enumerate(data, start=0):
            rows[i] += float(cell)


print(columns)
print(rows)

on this data file:

a0;a1;a2;a3;a4;a5;a6;a7;a8;a9;b0;b1;b2;b3;b4;b5;b6;b7;b8;b9;c0;c1;c2;c3;c4;c5;c6;c7;c8;c9;d0;d1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1
1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1

yields:

['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 'b8', 'b9', 'c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'd0', 'd1']
[10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0]

Working on a huge file with 1280000000 bytes of data took approx. 5 minutes on my machine to produce:

$> time ./so_csv_adder.py
['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 'b8', 'b9', 'c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'd0', 'd1']
[20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0, 20000000.0]

real    4m47.374s
user    4m43.748s
sys 0m2.545s
Dilettant
  • 3,267
  • 3
  • 29
  • 29
0
import csv
with open('yourBigFile.csv', 'rb') as f:
    spreadsheet=csv.reader(f) #you may need some options 
                              #depending on the format of the file
    header=None
    for row in spreadsheet:
        if header is None:
            header=row
            mySums=[0]*len(row) #  initialize to zero
            continue
        else:
            # this will only work if every cell has a number
            #   this will be faster, so use it if it is possible
            #   in your application
            #mySums=[mySums[x]+float(row[x]) for x in range(len(mySums))]

            # more generally
            for i,x in enumerate(row):
                try:
                    converted=float(x)
                except ValueError:   #you may actually want an error
                                     #raised.  YMMV depending on your data
                    converted=0
                mySums[i]+=converted

As I am not sure how you want the output to be formatted, I will leave that to you.