0

I am trying to replace the user_score with the average user_score for the game's platform and genre. This is my code:

dft = new_df.query('user_score != "tbd" & user_score.isnull()')
df_typical_user_ratio_by_platform = dft.groupby(['platform', 'genre'])['user_score'].apply(lambda x: x.sample(1).iloc[0])

def correct_user_score(row):
    platform = row['platform']
    genre = row['genre']
    if (row['user_score'] == 'tbd' or pd.isnull(row['user_score']) or row['user_score']=='nan'):
        u = df_typical_user_ratio_by_platform.loc[[platform, genre]].head(1).astype('float')
        uScore = ", ".join(map(str, u)) 
    else:
        uScore = row['user_score']
        
    return uScore

row = pd.Series(data=row_values, index=['user_score', 'platform', 'genre'])
correct_user_score(row)
new_df['user_score'] = new_df.apply(correct_user_score, axis=1)
new_df.sample(40)
# df['user_score'] = df['user_score'].astype('int')

This is the result. user_score is currently an object. I'm not sure how to replace nan. I tried doing if u = 'nan', but that didn't work. Any advice?

https://i.stack.imgur.com/g7AU4.jpg

Libby
  • 3
  • 4
  • Here are some ways to replace nan: https://www.geeksforgeeks.org/replace-nan-values-with-zeros-in-pandas-dataframe/ – LevB Feb 20 '21 at 05:11
  • right, but it's an object and 'nan' – Libby Feb 20 '21 at 06:27
  • Your image shows "NaN", which is of course not equal to "nan". Are you actually getting the string "NaN", or are you getting the floating point value NaN? Those are also two different things. – Tim Roberts Feb 20 '21 at 06:50
  • 1
    Try [this](https://stackoverflow.com/a/60203797/11380795) solution – RJ Adriaansen Feb 20 '21 at 06:51
  • sample data and sample output, the whole approach looks more complex than needed – Rob Raymond Feb 20 '21 at 07:13
  • Hi so I'm trying to fix the 'user_score' column only right now and it does have object 'nan' in it which is different from 'NaN'. @TimRoberts – Libby Feb 20 '21 at 08:16

1 Answers1

0
  • force invalid values to NaN with to_numerice()
  • fillna() with calculation you want
s = 20
df = pd.DataFrame({"userid":np.random.randint(1,5,s),
             "platform":np.random.choice(["windows","macos","ios","android"],s),
             "userscore":np.random.randint(1,10,s)})

# let's splat some scores...
df = df.assign(userscore=np.select([(df.userscore==7)&(df.index<10),(df.userscore==6)&(df.index<10)],["tbd",np.nan],df.userscore))

df["bad"] = df.userscore
df = df.assign(userscore=pd.to_numeric(df.userscore, errors="coerce"))
df.userscore = df.userscore.fillna(df.groupby(["userid","platform"])["userscore"].transform("mean"))

output

userid platform userscore bad
0 3 ios 8 8
1 3 ios 5 5
2 1 macos 4.5 tbd
3 2 macos 3 3
4 2 android 3 3
5 2 ios 4 4
6 1 macos 5 5
7 4 android 8 nan
8 1 macos 4 4
9 2 windows 2 2
10 2 android 1 1
11 4 windows 5 5
12 3 android 2 2
13 2 windows 9 9
14 3 android 8 8
15 2 windows 1 1
16 4 windows 8 8
17 2 windows 4 4
18 2 ios 3 3
19 4 android 8 8
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30