Rename output columns of groupby and count in DataFrame

Question

Could you tell me how to count the number of citations per patent for the following data?

"CITING","CITED"
3858241,956203
3858241,1324234
3858241,3398406
3858241,3557384
3858241,3634889
3858242,1515701
3858242,3319261
3858242,3668705
3858242,3707004
3858243,2949611
3858243,3146465
3858243,3156927

The "CITED" column holds the Patent number.

The desired output is a DataFrame in the following format:

 +--------+------+
 |NPatent|ncitations|
 +--------+------+
 | 3060453|  3   |
 | 3390168|  6   |
 | 3626542| 18   |
 | 3611507|  5   |
 | 3000113|  4   |

I'm currently using the following code, which is not generating the desired output:

# Importing Pandas 
import pandas as pd

# Reading the file in zipped format and save it to a DataFrame
df = pd.read_csv('/datos/cite75_99.txt.bz2', compression='bz2', header=0, sep=',', quotechar='"')

df = df.groupby('CITED').CITING.nunique()

print(df)

I would appreciate your help in getting the desired DataFrame.

Thank you!

this is a pandas code, you want solution in pyspark or pandas? — anky, Jun 21 '20 at 09:42

Christian Eslabon · Accepted Answer · 2020-06-22T05:56:02.610

1

import pandas as pd
df = df.groupby('CITED')['CITING'].count().reset_index()
df.columns = ['NPatent','ncitations']
df

edited Jun 22 '20 at 05:56

answered Jun 21 '20 at 09:48

Christian Eslabon

685
4
8

3

you might want to reset the index first: `df.groupby('CITED')['CITING'].count().reset_index()` – anky Jun 21 '20 at 09:51
This worked. Resetting the index was necessary for it to work. Thank you. – Jun 21 '20 at 10:42

Rename output columns of groupby and count in DataFrame

1 Answers1