2

I'm still new to python and not entirely sure on the way to approach this. I have a dataframe about video games with title, platform, global sales, and release dates of importance. There are some entries where release date is missing. I'd like to replace the missing value with the mean release date by platform if an entry also has a non 0 global sales value. I'm not entirely sure how to structure this so that it pulls the appropriate mean values, whether or not I need a nested loop, etc. Please let me know if I'm on the right track or what I can do to incorporate this and if you need any clarification, thanks!

    games.head()
                        Name         Platform  Global_Sales Release_Date
    0  Grand Theft Auto: San Andreas      PS2         20.81   2004-10-26
    1             Grand Theft Auto V      PS3         20.30   2013-09-17
    2             Grand Theft Auto V      PS4         18.46   2014-11-18
    3    Grand Theft Auto: Vice City      PS2         16.15   2002-10-28
    4             Grand Theft Auto V     X360         15.85   2013-09-17

    games.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 28852 entries, 0 to 28851
    Data columns (total 4 columns):
    Name            28852 non-null object
    Platform        28852 non-null category
    Global_Sales    16025 non-null float64
    Release_Date    27757 non-null datetime64[ns]

    for date in games.Release_Date:
      if pd.isnull(date) and games.Global_Sales !=0:
        games.Release_Date = [mean Release_Year for appropriate Platform]

I have another df with the mean values: platform_means, taken from splitting my datetime object and finding the mean year value that I want to use.

    platform_means.head()
      Platform      Release_Year  
    0      3DS       2012.282895       
    1       DC       2000.077778       
    2       DS       2007.654777       
    3       GB       1999.375000       
    4      GBA       2003.180401       

so here would be an example of what I would want, hopefully it helps. I can use Release_Date as datetime or Release_Date, which is an int depending on which is easier. I've just never had datetime before.

from something like this:

    games.head()
                                Name         Platform  Global_Sales Release_Date
            0             A                       PS2         20.81   2004-10-26
            1             B                       GBA         20.30   nan
            2             C                       PS4         00.00   nan
            3             D                       PS2           nan   nan
            4             E                      X360         15.85   2013-09-17

To this:

    games.head()
                                Name         Platform  Global_Sales Release_Date
            0             A                       PS2         20.81   2004-10-26
            1             B                       GBA         20.30   2003.18
            2             C                       PS4         00.00   nan
            3             D                       PS2           nan   nan
            4             E                      X360         15.85   2013-09-17

I've been using something like this and it works, but the conditional part doesn't. With the conditional, I get an error but without it, I just replace all rows that are missing dates instead of just the ones that also have sales values:

    for index, row in games[games['Release_Date'].isnull()].iterrows():
       if games['Global_Sales'] <= 0.01 | games['Global_Sales'].isnull():
           games.loc[games.index == index, 'Release_Date'] = 
    platform_means.loc[platform_means.Platform == row['Platform'], 
    'Release_Year'].item()
  • Please provide a dataframe that we can *reproduce*, e.g. see [**How to make good reproducible pandas examples**](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). This isn't just for *our* convenience, it's also to make sure we infer correct data types. – jpp Nov 14 '18 at 09:29
  • Can you add an example with desired output? Your `Release_Date` is `datetime64[ns]` object while `Release_Year` is `float64` I assume. How do you plan on representing that `float64` value in `datetime64[ns]` format? – zipa Nov 14 '18 at 11:08
  • Hi, (sorry busy week). Jpp - I don't understand what's wrong specifically even after reading the post, can you clarify please? @zipa sure! I'm not too familiar with datetime, I have another column with just the year as an int that I could use instead (and where I got the average values from) or whichever would be easier. I'm just looking for the general format for how to come up with the conditional code. – Jason Wilcox Nov 18 '18 at 00:54

2 Answers2

2

the following might be what you are looking for :

for index, row in games[games['Release_Date'].isnull()].iterrows():
    games.loc[games.index == index, 'Release_Date'] = platform_means.loc[platform_means.Platform == row['Platform'],'Release_Year'].item()
Pelican
  • 193
  • 2
  • 17
  • Thanks this works except it replaces all missing values instead of just those who also have a non-0/null value in global_sales. I'm not sure how to properly incorporate a conditional onto this, is there a simple way? – Jason Wilcox Nov 18 '18 at 00:47
  • I've been trying something like this, but it's not working as I expected and I get an error: the truth value of a series is ambiguous. – Jason Wilcox Nov 18 '18 at 22:14
  • 1
    @JasonWilcox I will be close to my computer tomorrow morning (GMT) . Have you tried filtering on 2 conditions instead of only the index in « games.loc[(condition1)&(condition2),’Realease_Date’] = ... – Pelican Nov 18 '18 at 22:29
  • 1
    Thanks!!! Yes, I tried a few different conditionals, but changed my mind and decided that I'd prefer to replace all missing values and instead remove unneeded data later (as for some analysis the imputed values will be helpful). Thanks again for the help, I'm still learning Python (much prefer R!) – Jason Wilcox Nov 19 '18 at 19:37
0

I would try using the pd.where method. See docs.

games['Release_Date'].where(games['Release_Date'].isnull(), 
                            games.join(platform_means, on='Platform')['Release_Year'])
zsomko
  • 581
  • 2
  • 6