I have a DataFrame
from pandas called df
which has multiple columns (only shown 3 below) and 90,000 rows:
Key Date Rating
0 123abc 08/19/2015 A
1 456def 04/23/2013 B-
2 123abc 06/10/2012 C
3 789ghi 01/04/2017 B
. . . .
. . . .
90000 999zzz 12/12/2012 D
I want to create a separate DataFrame
, df_ratings
, that has two columns: Key
and Rating List
. In df_ratings
, the Key
column needs to be unique, and the Rating List
column should contain a list of all Ratings
that appear against that Key
in df
.
Key Rating List
0 123abc ['A', 'C']
1 456def ['B-']
2 789ghi ['B', 'D']
. . .
. . .
30000 999zzz ['A', 'C+', 'D']
The approach I have used so far is:
df_zip = list(zip(df['Key'], df['Rating']))
def dfRatingsList(row):
rating = []
for x, y in df_zip:
if row['Key'] == x:
rating.append(y)
return rating
df_ratings = pd.DataFrame(df['Key'].unique(), columns=['Key'])
df_ratings = df_ratings.fillna('NULL')
df_ratings['Rating List'] = df_ratings.apply(dfRatingsList, axis=1)
Given the size of my dataset, this takes multiple hours to run. How can I speed this process up/improve my code?