1

I have two dataframes df_country_cluster and df_countries, with the following structure (in order):

            cluster_id
country_id
1                    4
2                    4
...                  ...

                     col1   col2   col3   col4
country_id  year_id
1           2015     0.1    0.2    0.3    0.1
1           2016     0.4    NaN    0.1    0.8
1           2017     0.7    0.2    0.6    NaN
1           2018     0.9    0.4    0.7    0.2  
2           2015     0.5    0.6    NaN    0.3 
2           2016     0.3    0.7    0.2    0.5 
2           2017     0.2    0.9    0.3    0.5 
2           2018     0.1    0.2    0.1    0.9 
...         ...      ...    ...    ...    ...

My goal is to fill the NaN values with the average non-NaN values of each cluster and year. This means that, for example, the NaN in country_id 1, year_id 2016, col2 should be filled with the average of the valid values of col2 for 2016 and all the countries with cluster_id 4 (in this case).

In the above example, we would get the averages of cluster_id 4 in this way:

                     col1   col2   col3   col4
cluster_id  year_id
4           2015     0.3    0.4    *0.3   0.2
4           2016     0.4    *0.7   0.2    0.6
4           2017     0.4    0.6    0.4    *0.5
4           2018     0.5    0.3    0.4    0.6

Therefore, the NaN each column would be filled by the values with a *.

I have tried to create a new DataFrame with groupby().mean(), and then use .fillna, but without success. Other SO questions like this only discuss the single-index problem.

Here is my approach:

    cols = ['col1','col2','col3','col4']
    original_index = df_countries.index

    df_countries = df_countries.join(df_country_cluster,on='country_id')
    df_countries = df_countries.reset_index().set_index(['cluster_id','year_id'])
    avg_cluster = df_countries.groupby(['cluster_id','year_id']).mean()
    avg_cluster = avg_cluster[cols]

    for col in cols:
        df_countries[col].fillna(avg_cluster[col],inplace=True)

    df_countries.reset_index().set_index(original_index)
nunoapedro
  • 13
  • 4
  • What is expected output? What are numbers instead NaNs ? – jezrael Mar 13 '19 at 12:12
  • The expected output is to have the df_countries dataframe with the NaN filled by the average non-NaN values of each cluster and year. – nunoapedro Mar 13 '19 at 12:22
  • 1
    Hi, it is not clear to me how you want to take the average. In your example, since there is only one `country_id=1` for `year_id=2016` it is not possible to take an average. I get that this is only an example. But I think perhaps you should take an example where it is possible to find the average. Or am I not understanding something? – mortysporty Mar 13 '19 at 12:22
  • @nunoapedro - For me unclear too, so ask what are averages numbers from sample data? What is number for repalce first NaN, for second and similar? It seems your data are not [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) – jezrael Mar 13 '19 at 12:24
  • Hey @mortysporty and jezrael, it wasn't clear in the questions, but I've updated the tables. The data is complete in that each country_id has a cluster_id between 1 and 4, and there are 50 countries, each with data from 2015 to 2018. I have only included an example. – nunoapedro Mar 13 '19 at 12:32
  • When I run `df_countries.groupby(['cluster_id','year_id']).mean()` I get the average of each cluster / year – nunoapedro Mar 13 '19 at 12:33
  • Hi again. I dont see how you adding that last row changes anything with regards to my earlier comment. Based on the provided data it is still not possible to create an average for `cluster_id=4` and `year_id=2016` for `co2l`. Anyway, I created some dummy data based on my understanding of what you want… and running your code it seems to me you get the desired output. Im struggeling to understand your problem... – mortysporty Mar 13 '19 at 14:15
  • I think perhaps you should provide an example of your desired output.... – mortysporty Mar 13 '19 at 14:21
  • @mortysporty, the output is the `df_countries` dataframe with the NaN values filled. I've added more info that should help explain what I need. Thank you – nunoapedro Mar 13 '19 at 14:43

2 Answers2

0

I am not sure if I understand you right. But your approach looks good. So we start with

df = df_countries.join(df_country_cluster, on='country_id')
df = df.reset_index().set_index(['cluster_id','year_id'])

You say that the fill value is the mean per group:

s = df[['col1', 'col2', 'col3', 'col4']].mean(axis=1)

We can put things togehter if we transpose the DataFrame

df = df.T.fillna(value=s).T

Finally we drop what we don't need

df = df.reset_index().drop(columns='cluster_id').set_index('country_id', 'year_id')
JoergVanAken
  • 1,286
  • 9
  • 10
  • The fill value would have to be the mean per cluster and year, something along the lines of what `avg_cluster` prints. However, I will try transposing the DataFrame. – nunoapedro Mar 13 '19 at 12:55
  • `s` is the mean per cluster and year, have a look at the `axis=1` argument. – JoergVanAken Mar 13 '19 at 12:58
  • thank you, that's right, it does calculate the mean by cluster and year. However, when I print it, it averages col1, col2, col3 and col4 - I would need the average of each column (but by cluster, year), and apply that to the country, year NaN values – nunoapedro Mar 13 '19 at 13:13
0

Got it.

df_countries = df_countries.reset_index().set_index(original_index)

Forgot to keep the answer with the correct index... With this change, it works. However, if anyone has a more pythonic way to do it, please add your answer!

nunoapedro
  • 13
  • 4