0

This is a pretty basic question. I want to organize a csv a certain way. So far I have it almost as I want it, and it is pretty fast as well.

import numpy as np
import pandas as pd
import pandas.util.testing as tm; tm.N = 3


data = pd.DataFrame.from_csv('MYDATA.csv')
byqualityissue = data.groupby(["Name","Type"]).size()
df = pd.DataFrame(bytype)

formatted = df.unstack(level=-1)
formatted[np.isnan(formatted)] = 0
formatted.to_csv('bytype.csv', header=True)

bytype looks like this:

Type          Type1        Type2       Type3
Name  
Company 1      10            0           3
Company 2      4             23          3 
Company 3      3             2           0

However, I want it sorted so that the company with the largest total (sum of the row) will be at the top. In this case it should be Company 2 at the top, followed by Company 1, then Company 3. It should select only the top twenty customers to be read in by the rest of my python code.

Name,         Type1,    Type2,   Type3
Company 2,     4,        23,       3
Company 1,     10,       0,        3
Company 3,     3,        2,        0

I think the best way would be to add up the totals in another column and then sort by descending order. In order to do that I wrote this line:

sorted = np.sum(formatted, axis=1)

which successfully outputs a total. I looked at How to add a new column to a CSV file using Python?

And so wrote:

with open('bytype.csv') as csvinput:
    with open('bytype2.csv','w') as csvoutput:
        writer = csv.writer(csvoutput, lineterminator='\n')
        reader = csv.reader(csvinput)

        all = []
        row = next(reader)
        row.append('Total')
        all.append(row)

        for row in reader:
            row.append(np.sum(formatted, axis=1))
            all.append(row)

        writer.writerows(all)

But not only am I not getting an output bytype2, but that method seemed to mean re-reading the csv and then appending the column, and THEN I'd still have to sort them. There must be an easier way to do this that I'm missing.

Community
  • 1
  • 1
jenryb
  • 2,017
  • 12
  • 35
  • 72
  • Just add the column to your dataframe, df[key] = whatever – Padraic Cunningham Jun 08 '15 at 22:31
  • @PadraicCunningham I know it's probably as simple as that, but how exactly do you add it? df[newkeytitle] = sorted? How does it know where in the dataframe to put it? I've been teaching myself python for the last week but I'm still very new. – jenryb Jun 08 '15 at 22:36
  • yes, just use the new col name as the key and whatever you want to add as the value, can you share a little bit if your file? – Padraic Cunningham Jun 08 '15 at 22:39
  • The column will be added to the end of the dataframe, the last column. – Padraic Cunningham Jun 08 '15 at 22:47
  • @PadraicCunningham I can't share my actual data, but I could muster up an example file if you want one. I tried writing df[Total]=sorted but Total as a key is not defined since it's a new column I'm trying to add. So I'm getting (predictably) this error: NameError: name 'Total' is not defined – jenryb Jun 08 '15 at 22:54
  • it needs to be a string `df["Total"]` – Padraic Cunningham Jun 08 '15 at 22:58

1 Answers1

1

Answered my own question after tinkering for another day. Here's how I added a column of totals, and then sorted by that column.

includingtotals = pd.concat([formatted,pd.DataFrame(formatted.sum(axis=1),columns=['Total'])],axis=1)
sorted = includingtotals.sort_index(by=['Total'], ascending=[False])
sorted.to_csv('byqualityissue.csv', header=True)
jenryb
  • 2,017
  • 12
  • 35
  • 72