I'm still new to python and not entirely sure on the way to approach this. I have a dataframe about video games with title, platform, global sales, and release dates of importance. There are some entries where release date is missing. I'd like to replace the missing value with the mean release date by platform if an entry also has a non 0 global sales value. I'm not entirely sure how to structure this so that it pulls the appropriate mean values, whether or not I need a nested loop, etc. Please let me know if I'm on the right track or what I can do to incorporate this and if you need any clarification, thanks!
games.head()
Name Platform Global_Sales Release_Date
0 Grand Theft Auto: San Andreas PS2 20.81 2004-10-26
1 Grand Theft Auto V PS3 20.30 2013-09-17
2 Grand Theft Auto V PS4 18.46 2014-11-18
3 Grand Theft Auto: Vice City PS2 16.15 2002-10-28
4 Grand Theft Auto V X360 15.85 2013-09-17
games.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28852 entries, 0 to 28851
Data columns (total 4 columns):
Name 28852 non-null object
Platform 28852 non-null category
Global_Sales 16025 non-null float64
Release_Date 27757 non-null datetime64[ns]
for date in games.Release_Date:
if pd.isnull(date) and games.Global_Sales !=0:
games.Release_Date = [mean Release_Year for appropriate Platform]
I have another df with the mean values: platform_means, taken from splitting my datetime object and finding the mean year value that I want to use.
platform_means.head()
Platform Release_Year
0 3DS 2012.282895
1 DC 2000.077778
2 DS 2007.654777
3 GB 1999.375000
4 GBA 2003.180401
so here would be an example of what I would want, hopefully it helps. I can use Release_Date as datetime or Release_Date, which is an int depending on which is easier. I've just never had datetime before.
from something like this:
games.head()
Name Platform Global_Sales Release_Date
0 A PS2 20.81 2004-10-26
1 B GBA 20.30 nan
2 C PS4 00.00 nan
3 D PS2 nan nan
4 E X360 15.85 2013-09-17
To this:
games.head()
Name Platform Global_Sales Release_Date
0 A PS2 20.81 2004-10-26
1 B GBA 20.30 2003.18
2 C PS4 00.00 nan
3 D PS2 nan nan
4 E X360 15.85 2013-09-17
I've been using something like this and it works, but the conditional part doesn't. With the conditional, I get an error but without it, I just replace all rows that are missing dates instead of just the ones that also have sales values:
for index, row in games[games['Release_Date'].isnull()].iterrows():
if games['Global_Sales'] <= 0.01 | games['Global_Sales'].isnull():
games.loc[games.index == index, 'Release_Date'] =
platform_means.loc[platform_means.Platform == row['Platform'],
'Release_Year'].item()