Grouping and Sub-Grouping CSV data in Python

Question

This is my sample dataset in a CSV format:

Column[1], Column[2], Account, CostCentre, Rate, Ex VAT,  VAT
000000000, 00000000,  4200213,    G1023,       0, 10.50,  0.0
000000000, 00000000,  4200213,    G1023,      20, 10.50,  2.1
000000000, 00000000,  4200213,    G1023,       0, 10.50,  0.0
000000000, 00000000,  4200213,    G1023,      20, 10.50,  2.1

I am trying to create an output file that focuses on the account number and further groups it by Cost Centre and Tax Rate, therefore, anything that has the account number 4200213 needs to be included in the output else all other rows can be ignored.

Secondly, if the Cost Centre is being repeated let's say in this instance G1023, I want the python script to establish whether the Tax Rates match and if so I want the output file to group it together by Rate and sum up the total Ex VAT and VAT cost in such a way that the desired result should be as follows:

Cost Centre, Rate, Ex VAT, VAT, In VAT

      G1023,    0,     21,   0,    21     
      G1023,   20,     21, 4.2,    25.20

I've been trying to figure it out but without any success. My current code goes as follows:

import os
import sys
import csv

os.path.dirname = "./"
InputFile_name = "Book1.csv"
InputFile = csv.reader(open(InputFile_name, "r"))
OutputFile_name = "Journal.csv"
OutputFile = open(OutputFile_name, "w")
mydict = []

OutputFile.write("Cost Centre, Tax Rate, Total Ex VAT, VAT, Total In VAT\n")

for line in InputFile:
    if line[2] == "4200213":
        Cost_Centre = line[3]
        Rate = line[4]
        Ex_VAT = line[5]
        VAT = line[6]
        if Cost_Centre in mydict:
            continue
        else:
            mydict.append(Cost_Centre)

        for item in mydict:
            if item in Cost_Centre and Rate == "0":
                Ex_VAT += Ex_VAT
                VAT+= VAT
                In_VAT = Ex_VAT + VAT
            elif item in Cost_Centre and Rate == "20":
                Ex_VAT += Ex_VAT
                VAT+= VAT
                In_VAT = Ex_VAT + VAT
            OutputFile.write(",".join([Cost_Centre,Rate,Ex_VAT,VAT,In_VAT+"\n"]))
OutputFile.close()
print "Finished."
sys.exit()

The script works but I am too far from getting the desired results. And as you'd have figured out that I am not very good at python so would appreciate if rather than just pointing out the mistake if you could make amendments on the script and provide me with the whole script please with some explanation as to what I am doing wrong.

You already noticed, that in your sample dataset and the desired output there is not always a ``,`` between all cells? — Stephan Kulla, Apr 11 '14 at 01:44
I haven't put them here just for the sake of keeping it clean but in the input file and the output file they is a `,` if we check them in text-pad or sublime. I have added them in the question now to clear any confusions. Thanks. — Python Learner, Apr 11 '14 at 07:14
Unfortunately I do not have time now to show a working example... Maybe later... Let me know, if you found a solution in the meanwhile. — Stephan Kulla, Apr 11 '14 at 09:47
Hi @tampis. Thanks for your answer however, I am still struggling with getting the script as my real data set is quite complex and has more than 4 accounts and approximately 70 Cost Centres with each Cost Centre having two tax rates and around 10 entries for each tax rate. So, the purpose of the script should be to just select one account and then group it by the Cost Centre (0 Tax Rate) and add all the costs for that Cost Centre which have 0 Tax Rate and then on a separate row Cost Centre with 20 Tax rate and all their costs. — Python Learner, Apr 11 '14 at 10:08

score 1 · Accepted Answer · answered Apr 11 '14 at 11:15

1

You can use itertools.groupby. I wrote this, unfortunately it is not easy to read.

import csv
import itertools

csvreader = csv.reader(open("Book1.csv", "r"))
lines = [line for line in csvreader]

#Sort
lines =  sorted(lines[1:], key = lambda x: (x[4], x[3], x[2]))

#Grouping
newRows = []
for grp in itertools.groupby(lines, key = lambda x: (x[2], x[3], x[4])):
    newRow = [0, 0] + list(grp[0]) + [0.0, 0.0, 0.0]
    for col in grp[1]:
        newRow[5] += float(col[5])
        newRow[6] += float(col[6])
        newRow[7] += float(col[5]) + float(col[6])
    newRows.append(newRow)

#Filtering and write csv
with open("Journal.csv", "w") as fp:
    csvwriter = csv.writer(fp)
    csvwriter.writerow(["Cost Centre", "Tax Rate", "Total Ex VAT", "VAT", "Total In VAT"])
    for r in filter(lambda x:x[2].strip() == "4200213", newRows):
        csvwriter.writerow(r[3:])

I hope it helps.

answered Apr 11 '14 at 11:15

Kei Minagawa

4,395
3
25
43

I have got it working on the actual data, thanks, but would you mind explaining a little bit about the functions mentioned above please and also the output file has got empty rows after each entry, is it possible to fix that? – Python Learner Apr 11 '14 at 13:12
@Python Learner: If you are new to Python, I think it's hard to realize this code. One thing I can say is before you use `itertools.groupby` function, the lists have to be sorted as you like. Because the function recognize a consecutive same elements as one group. If you want another function behavior, here is the keywords to search. `sorted multiple key` `itertools.groupby multiple key` `csv.writerow` `python filter` – Kei Minagawa Apr 11 '14 at 15:22
@Python Learner: `and also the output file has got empty rows after each entry, is it possible to fix that?` Do you want to add new row at the end of `Journal.csv` which contain a data? If so, you can easily do this by using `open("Journal.csv", "a")` which opens the file with append mode. – Kei Minagawa Apr 11 '14 at 15:26
Many thanks! I will certainly have a look at the functions you mentioned. And yes, you're right I am new to python hence was struggling understanding your code. Thank you so much, much obliged! – Python Learner Apr 11 '14 at 18:00

score 1 · Answer 2 · answered Apr 12 '14 at 01:03

1

Life is too short. This is what libraries like pandas excel at. The entire code:

import pandas as pd
df = pd.read_csv("tax.csv", skipinitialspace=True)
d2 = df.groupby(["CostCentre", "Rate"])[["Ex VAT", "VAT"]].sum()
d2["IN VAT"] = d2["Ex VAT"] + d2["VAT"]
d2.reset_index().to_csv("taxout.csv", index=False)

which produces a new csv file which looks like:

CostCentre,Rate,Ex VAT,VAT,IN VAT
G1023,0,21.0,0.0,21.0
G1023,20,21.0,4.2,25.2

answered Apr 12 '14 at 01:03

DSM

342,061
65
592
494

Hi DSM, Thanks for the code but since I am new to python I had never heard of that function before. Will try to learn more about pandas. Thanks for your answer, much appreciated. – Python Learner Apr 14 '14 at 08:24

score 0 · Answer 3 · edited May 23 '17 at 12:02

I have added some comments in your code:

import os
import sys # not necessary (see comment below)
import csv

os.path.dirname = "./" # not necessary (current directory is always $PWD)

# I would do:
InputFile = csv.reader(open("Book1.csv", "r"))
OutputFile = open("Journal.csv", "w")
mydict = [] # Okay, but you can also use set() (that's the structure you want in the end)
            # name "mydict" is confusion (it's a list)

OutputFile.write("Cost Centre, Tax Rate, Total Ex VAT, VAT, Total In VAT\n")

for line in InputFile:
    if line[2] == "4200213":
        Cost_Centre = line[3]
        Rate = line[4]
        Ex_VAT = line[5] # you mean float(line[5])
        VAT = line[6]    # you mean float(line[6])
        if Cost_Centre in mydict:
            continue
        else:
            mydict.append(Cost_Centre)

        for item in mydict:
            # Why do you have an if-else statement here? Inside each branch you are doing always the same!
            # Why do not you delete this if else statement?
            if item in Cost_Centre and Rate == "0": # I guess you mean: item == Cost_Centre
                Ex_VAT += Ex_VAT
                VAT+= VAT
                In_VAT = Ex_VAT + VAT
            elif item in Cost_Centre and Rate == "20": # I guess you mean: item == Cost_Centre
                Ex_VAT += Ex_VAT
                VAT+= VAT
                In_VAT = Ex_VAT + VAT

            # I would write
            # OutputFile.write(",".join([Cost_Centre,Rate,Ex_VAT,VAT,In_VAT]) +"\n")
            OutputFile.write(",".join([Cost_Centre,Rate,Ex_VAT,VAT,In_VAT+"\n"]))

OutputFile.close()

print "Finished."

sys.exit() # not necessary

Normally lowercase names are used in Python (see http://legacy.python.org/dev/peps/pep-0008/ or What is the naming convention in Python for variable and function names?)

Concerning to your problem you have to first read all lines and afterwards you have to compute and write your final CSV. The error lies here (example):

if line[2] == "4200213":
    ...
    Ex_VAT = float(line[5]) # new variable is read

    ...

         Ex_VAT += Ex_VAT # here will always get EX_VAT * 2

Update: Here is my code:

import csv

from collections import defaultdict
from operator    import add

class vector(tuple):
    def __add__(self, other):
        return vector(other) if len(self) == 0 else vector(map(add, self, other))

mydict = defaultdict(vector)

with open("data.csv", "r") as fd:
    for line in csv.reader(fd):
        line = map(str.strip, line)

        if line[2] == "4200213":
            mydict[line[3], line[4]] += float(line[5]), float(line[6])

with open("journal.csv", "w") as fd:
    writer = csv.writer(fd)
    writer.writerow(["Cost Centre", "Tax Rate", "Total Ex VAT", "VAT", "Total In VAT"])

    for k,v in mydict.iteritems():
        print repr(v)
        writer.writerow(list(k) + list(v) + [sum(v)])

With comments:

import csv

from collections import defaultdict # see https://docs.python.org/2/library/collections.html#collections.defaultdict
from operator    import add # add(x,y) == x + y

# for having vector( (1,2,3) ) + vector( (4,5,6) ) = vector( (5,7,9) )
# see https://stackoverflow.com/questions/2576296/using-python-tuples-as-vectors
# and lookup operator overloading for python on the internet
class vector(tuple): 
    def __add__(self, other):
        return vector(other) if len(self) == 0 else vector(map(add, self, other))

# will be in the end
# mydict = {
#     ("G1023","20"): vector((21.0,4.2)),
#     ("G1023","0"): vector((21.0,0.0))
# }
mydict = defaultdict(vector)

# === read csv file ===

with open("data.csv", "r") as fd: # we have not call fd.close() at the end -> very handy ;-) + exception save!
    for line in csv.reader(fd):
        line = map(str.strip, line) # delete whitespaces from all cells

        if line[2] == "4200213":
            mydict[line[3], line[4]] += float(line[5]), float(line[6])

# === write final csv file ===

with open("journal.csv", "w") as fd:
    writer = csv.writer(fd)
    writer.writerow(["Cost Centre", "Tax Rate", "Total Ex VAT", "VAT", "Total In VAT"])

    for k,v in mydict.iteritems():
        writer.writerow(list(k) + list(v) + [sum(v)]) # output each line in the csv

I recommend you to read the above code slowly line by line until you understand, how everything works (I used pretty cool features of python). Look up things you don't know already in the Internet. If you have some question, feel free to ask me in the comments or do a follow up question here on Stackoverflow.

I have got a working code provided by another user, thanks to him/her but would appreciate if you can also provide another solution. From what I have heard there are various ways of doing one thing in python. Thanks in anticipation. — Python Learner, Apr 11 '14 at 18:02

Grouping and Sub-Grouping CSV data in Python

3 Answers3