-1

I have a DataFrame from pandas called df which has multiple columns (only shown 3 below) and 90,000 rows:

        Key        Date     Rating
0      123abc   08/19/2015    A
1      456def   04/23/2013    B-
2      123abc   06/10/2012    C
3      789ghi   01/04/2017    B
.        .           .        .
.        .           .        .
90000  999zzz   12/12/2012    D

I want to create a separate DataFrame, df_ratings, that has two columns: Key and Rating List. In df_ratings, the Key column needs to be unique, and the Rating List column should contain a list of all Ratings that appear against that Key in df.

        Key       Rating List
0      123abc     ['A', 'C']
1      456def       ['B-']
2      789ghi     ['B', 'D']
.        .            .
.        .            .
30000  999zzz   ['A', 'C+', 'D']

The approach I have used so far is:

df_zip = list(zip(df['Key'], df['Rating']))

def dfRatingsList(row):
    rating = []
    for x, y in df_zip:
        if row['Key'] == x:
            rating.append(y)
    return rating

df_ratings = pd.DataFrame(df['Key'].unique(), columns=['Key'])
df_ratings = df_ratings.fillna('NULL')
df_ratings['Rating List'] = df_ratings.apply(dfRatingsList, axis=1)

Given the size of my dataset, this takes multiple hours to run. How can I speed this process up/improve my code?

  • There's a good answer to this here: https://stackoverflow.com/questions/22219004/how-to-group-dataframe-rows-into-list-in-pandas-groupby – SimonR May 27 '20 at 14:50

1 Answers1

0

Try this:

df = df.groupby(by=['Key'], as_index=False).agg({'Rating': list})
print(df)

      Key        Rating
0  123abc  [A, A, A, A]
1  123def           [C]
2  456def          [B-]
3  789ghi           [B]
4  999zzz           [D]
NYC Coder
  • 7,424
  • 2
  • 11
  • 24