1

Could you tell me how to count the number of citations per patent for the following data?

"CITING","CITED"
3858241,956203
3858241,1324234
3858241,3398406
3858241,3557384
3858241,3634889
3858242,1515701
3858242,3319261
3858242,3668705
3858242,3707004
3858243,2949611
3858243,3146465
3858243,3156927

The "CITED" column holds the Patent number.

The desired output is a DataFrame in the following format:

 +--------+------+
 |NPatent|ncitations|
 +--------+------+
 | 3060453|  3   |
 | 3390168|  6   |
 | 3626542| 18   |
 | 3611507|  5   |
 | 3000113|  4   |

I'm currently using the following code, which is not generating the desired output:

# Importing Pandas 
import pandas as pd

# Reading the file in zipped format and save it to a DataFrame
df = pd.read_csv('/datos/cite75_99.txt.bz2', compression='bz2', header=0, sep=',', quotechar='"')

df = df.groupby('CITED').CITING.nunique()

print(df)

I would appreciate your help in getting the desired DataFrame.

Thank you!

1 Answers1

1
import pandas as pd
df = df.groupby('CITED')['CITING'].count().reset_index()
df.columns = ['NPatent','ncitations']
df
  • 3
    you might want to reset the index first: `df.groupby('CITED')['CITING'].count().reset_index()` – anky Jun 21 '20 at 09:51
  • This worked. Resetting the index was necessary for it to work. Thank you. –  Jun 21 '20 at 10:42