Sorting an entire csv by frequency of occurence in one column and and show only one copy of instance

Question

I have a csv that looks like this:

CompanyName    High Priority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User

I want to count how many time each instance in CompanyName appears in the entire file and sort by the number of appearance descend, but only print the CompanyName once:

For example, by using this code:

df['count'] = df.groupby('CompanyName'['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)

I get:

Out:

CompanyName HighPriority QualityIssue count
5   Customer3           No         User     4
3   Customer3           No    Equipment     4
7   Customer3          Yes    Equipment     4
6   Customer3          Yes         User     4
0   Customer1          Yes         User     3
4   Customer1           No      Neither     3
1   Customer1          Yes         User     3
8   Customer4           No         User     1
2   Customer2           No         User     1

What I want is:

   CompanyName   count
   Customer3       4
   Customer1       3
   Customer4       1
   Customer2       1

Any idea?

Problem 2: With empty rows:

CompanyName    High Priority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
                  No              User  
Customer3         No              Equipment  
Customer1         No              Neither
                  No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User

Expected output:

   CompanyName   count
   Customer3       3
   Customer1       3
                   2
   Customer4       1

It's just empty, probably "". It comes from an excel file. – AMayer Jul 10 '17 at 22:40 — AMayer, Jul 10 '17 at 22:40

Adam · Accepted Answer · 2017-07-10T22:59:06.330

3

I think you can skip both lines of and simply write

# single columns
df.CompanyNames.value_counts()
# or
df['CompanyNames'].value_counts()
# or via Sriram solution
df.groupby(['CompanyNames']).size()

# Multiple columns
df.groupBy(['CompanyNames', 'HighPriority']).size()

Python: get a frequency count based on two columns (variables) in pandas dataframe

That should give you what you want instead of appending the count as a column.

EDIT

Replace Nan values then find count

df.CompanyNames = df.CompanyNames.fillna('unknown')
# or inline
df.CompanyNames.fillna('unknown', inplace=True)

Then use previously code to summarize

edited Jul 10 '17 at 22:59

answered Jul 10 '17 at 20:43

Adam

3,992
2
19
39

Thank you! What if there's a 3rd column that has the same value for the CustomerName, for example: Customer3 Value1, Customer2 Value2, Customer3 Value1 etc. I would like to achieve the same thing as above but also print the column with the ValueX. So it would be Customer3 Value1 2; Customer 2 Value2 1 – AMayer Jul 10 '17 at 21:07
1) Mark answer as correct if that is the case. 2) I'm not sure what you mean. – Adam Jul 10 '17 at 21:13
Are you trying to find the frequency count for multiple column values? It would be similar to the other answer posted. Also this. https://stackoverflow.com/questions/33271098/python-get-a-frequency-count-based-on-two-columns-variables-in-pandas-datafra – Adam Jul 10 '17 at 21:18
it works with a single flaw. It doesn't count the empty cells. Is there a solution for that aswell? – AMayer Jul 10 '17 at 22:29
Are the cells empty before the `groupBy` statement? i.e. raw data does not have a value for that particular row and column. – Adam Jul 10 '17 at 22:31
I've updated the main post. It is marked with Problem 2 for a better understanding the problem. Thank you! – AMayer Jul 10 '17 at 22:35
Post the actual output of the second part – Adam Jul 10 '17 at 22:36
With DataFrame.groupby(["CompanyName"," High Priority"]).size() It ignores the emtpy rows. – AMayer Jul 10 '17 at 22:38

score 2 · Answer 2 · answered Jul 10 '17 at 20:57

2

With groupby you can do like this:

df.groupby(['CompanyNames']).size()

answered Jul 10 '17 at 20:57

Sriram Sitharaman

867
6
15

Sorting an entire csv by frequency of occurence in one column and and show only one copy of instance

2 Answers2