4

In Python, Im trying to perform value_counts on a Pandas column. I can get that to work, but I can't figure out how to get some of the other associated columns.
The code:

import pandas as pd

myframe = pd.DataFrame({"Server":["Server_1","Server_1","Server_1","Server_1","Server_1","Server_2","Server_2","Server_2","Server_2","Server_3","Server_3","Server_3","Server_3","Server_3"], 
"CVE_ID":["CVE-2017-1111","CVE-2017-1112","CVE-2017-1113","CVE-2017-1114","CVE-2017-1115","CVE-2017-1111","CVE-2017-1112","CVE-2017-1113","CVE-2017-1114","CVE-2017-1113","CVE-2017-1114","CVE-2017-1115","CVE-2017-1116","CVE-2017-1117"],
"VulnName":["Java Update 1","Java Update 2","Java Update 3","Adobe 1","Chrome 1","Java Update 1","Java Update 2","Java Update 3","Adobe 1","Java Update 3","Adobe 1","Chrome 1","Chrome 2","Chrome 3"],
"ServerOwner":["Alice","Alice","Alice","Alice","Alice","Bob","Bob","Bob","Bob","Carol","Carol","Carol","Carol","Carol"]})

print "The dataframe: \n", myframe
print "Top 10 offending CVEs, Vulnerability and Count: \n"
print myframe['CVE_ID'].value_counts()

The last line prints out 2 columns: one of CVEs, and one of how many times it occurred. But I want to print out something like this, where it maintains a link between the CVE and the Vulnerability Name (seen in the middle column):

Top 10 offending CVEs, Vulnerability and Count:
CVE-2017-1113   Java Update 1     3
CVE-2017-1114   Java Update 2     3
...etc...

How do I do that? Everything I do keeps throwing errors.

Joshua
  • 40,822
  • 8
  • 72
  • 132

2 Answers2

3

Edit: Changed so that in output have column name access

(Note the addition of as_index=False and .reset_index in [1] See sources 5 and 6

[1] First groupby on CVE_ID column and use size:

counts = myframe.groupby(['CVE_ID','VulnName','ServerOwner'], as_index=False).size().unstack(fill_value=0).reset_index()


ServerOwner         CVE_ID       VulnName  Alice  Bob  Carol
0            CVE-2017-1111  Java Update 1      1    1      0
1            CVE-2017-1112  Java Update 2      1    1      0
2            CVE-2017-1113  Java Update 3      1    1      1
3            CVE-2017-1114        Adobe 1      1    1      1
4            CVE-2017-1115       Chrome 1      1    0      1
5            CVE-2017-1116       Chrome 2      0    0      1
6            CVE-2017-1117       Chrome 3      0    0      1

[2] Then sum over the Alice, Bob and Carol column to get:

counts['Count'] = counts[['Alice','Bob','Carol']].sum(axis=1)

ServerOwner         CVE_ID       VulnName  Alice  Bob  Carol  Count
0            CVE-2017-1111  Java Update 1      1    1      0      2
1            CVE-2017-1112  Java Update 2      1    1      0      2
2            CVE-2017-1113  Java Update 3      1    1      1      3
3            CVE-2017-1114        Adobe 1      1    1      1      3
4            CVE-2017-1115       Chrome 1      1    0      1      2
5            CVE-2017-1116       Chrome 2      0    0      1      1
6            CVE-2017-1117       Chrome 3      0    0      1      1

[3] Then delete name columns using df.drop on names:

counts.drop(['Carol','Bob','Alice'],inplace=True,axis=1)

ServerOwner         CVE_ID       VulnName  Count
0            CVE-2017-1111  Java Update 1      2
1            CVE-2017-1112  Java Update 2      2
2            CVE-2017-1113  Java Update 3      3
3            CVE-2017-1114        Adobe 1      3
4            CVE-2017-1115       Chrome 1      2
5            CVE-2017-1116       Chrome 2      1
6            CVE-2017-1117       Chrome 3      1

[4] Then you use sort_values on the sum column:

counts.sort_values(by='Count', ascending=False, inplace=True)

ServerOwner         CVE_ID       VulnName  Count
2            CVE-2017-1113  Java Update 3      3
3            CVE-2017-1114        Adobe 1      3
0            CVE-2017-1111  Java Update 1      2
1            CVE-2017-1112  Java Update 2      2
4            CVE-2017-1115       Chrome 1      2
5            CVE-2017-1116       Chrome 2      1
6            CVE-2017-1117       Chrome 3      1

Combined:

counts = myframe.groupby(['CVE_ID','VulnName','ServerOwner'], as_index=False).size().unstack(fill_value=0).reset_index()
counts['Count'] = counts[['Alice','Bob','Carol']].sum(axis=1)
counts.drop(['Carol','Bob','Alice'],inplace=True,axis=1)
counts.sort_values(by='Count', ascending=False, inplace=True)

print "The dataframe: \n", myframe
print "Top 10 offending CVEs, Vulnerability and Count: \n"
print counts

Top 10 offending CVEs, Vulnerability and Count: 

ServerOwner         CVE_ID       VulnName  Count
2            CVE-2017-1113  Java Update 3      3
3            CVE-2017-1114        Adobe 1      3
0            CVE-2017-1111  Java Update 1      2
1            CVE-2017-1112  Java Update 2      2
4            CVE-2017-1115       Chrome 1      2
5            CVE-2017-1116       Chrome 2      1
6            CVE-2017-1117       Chrome 3      1

If needed, Can use reset_index() to reset index at this point.

Edit: In response to comment about serverOwner index, you can reset index, drop old index, and rename new index:

counts.reset_index(drop=True, inplace = True)
counts.index.names = ['index']

Gives:

ServerOwner         CVE_ID       VulnName  Count
index                                           
0            CVE-2017-1113  Java Update 3      3
1            CVE-2017-1114        Adobe 1      3
2            CVE-2017-1111  Java Update 1      2
3            CVE-2017-1112  Java Update 2      2
4            CVE-2017-1115       Chrome 1      2
5            CVE-2017-1116       Chrome 2      1
6            CVE-2017-1117       Chrome 3      1

(The ServerOwner name remains as a remnant of the original groupby command to detail which column was used.)

Sources for this answer:

[1] Groupby value counts on the dataframe pandas dataframe-pandas

[2] Pandas: sum DataFrame rows for given columns

[3] Delete column from pandas DataFrame

[4] python, sort descending dataframe with pandas

[5] Converting a Pandas GroupBy object to DataFrame

[6] How to GroupBy a Dataframe in Pandas and keep Columns

Community
  • 1
  • 1
Chuck
  • 3,664
  • 7
  • 42
  • 76
  • Just out of curiosity, what is the first ServerOwner column and how do you get rid of it? I cant just show the CVE_ID, VulnName and Count columns. Is ServerOwner a new name for the index? – user3688402 Apr 12 '17 at 19:27
  • I think it is. If you do `reset_index()` it will add on another index column but `ServerOwner` will still remain. I think it's because of the use of `group by` and making it equal to a new dataframe object: I'm not sure if this is technically the correct way to save it, might be a bit sketchy. Sorry i cant be more help now - will take a look in the morning (bit late here). But yes, i think at some stage it interprets `Server owner` as your index. You could try dropping that column at different points in the process using `df.drop` and see if that gets rid of it. – Chuck Apr 12 '17 at 21:31
  • @user3688402 See my edit for index solution. Many thanks for upvoting and accepting. Glad I could help :) – Chuck Apr 13 '17 at 09:53
1

Use join to append value_counts

myframe.join(myframe['CVE_ID'].value_counts().rename('Count'), on='CVE_ID')

           CVE_ID    Server ServerOwner       VulnName  Count
0   CVE-2017-1111  Server_1       Alice  Java Update 1      2
1   CVE-2017-1112  Server_1       Alice  Java Update 2      2
2   CVE-2017-1113  Server_1       Alice  Java Update 3      3
3   CVE-2017-1114  Server_1       Alice        Adobe 1      3
4   CVE-2017-1115  Server_1       Alice       Chrome 1      2
5   CVE-2017-1111  Server_2         Bob  Java Update 1      2
6   CVE-2017-1112  Server_2         Bob  Java Update 2      2
7   CVE-2017-1113  Server_2         Bob  Java Update 3      3
8   CVE-2017-1114  Server_2         Bob        Adobe 1      3
9   CVE-2017-1113  Server_3       Carol  Java Update 3      3
10  CVE-2017-1114  Server_3       Carol        Adobe 1      3
11  CVE-2017-1115  Server_3       Carol       Chrome 1      2
12  CVE-2017-1116  Server_3       Carol       Chrome 2      1
13  CVE-2017-1117  Server_3       Carol       Chrome 3      1

If you want to limit it to the top n, (my example shows 2), use head and how='inner'

myframe.join(
    myframe['CVE_ID'].value_counts().head(2).rename('Count'),
    on='CVE_ID', how='inner')

           CVE_ID    Server ServerOwner       VulnName  Count
2   CVE-2017-1113  Server_1       Alice  Java Update 3      3
7   CVE-2017-1113  Server_2         Bob  Java Update 3      3
9   CVE-2017-1113  Server_3       Carol  Java Update 3      3
3   CVE-2017-1114  Server_1       Alice        Adobe 1      3
8   CVE-2017-1114  Server_2         Bob        Adobe 1      3
10  CVE-2017-1114  Server_3       Carol        Adobe 1      3
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Man, there is always a super simple one line way to do things, nice. – Chuck Apr 12 '17 at 08:25
  • Do you not need to reduce the `CVE_ID` column so that you are not showing repeated ID's? i.e. You have three `1113` rows, when only one is required, as it is the count that is required, not the owner name? – Chuck Apr 12 '17 at 11:02