0

I am following the University of Michigan MOOC about Data Science in Python Pandas, and I encounter some problem in a test.

I have to use the groupby function to calculate the sum, mean, size and standard deviation of 15 countries, grouped by continent.

The problem is that sum(), std() and size() work with no problem, but not mean(), and I don't know why.

I already tried to specify the type by using dtype=float but I does not work.

This is my code :

# --------- This part is ok, just describing so you can understand --------- #
Top15 = answer_one() # load top 15 countries with most scientific publications

# list of the continents for the top 15 countries
ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}

# estimation of the population for each countries 
# by calculating the Energy Supply / Energy Supply per Capita
Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
Top15 = Top15[['PopEst']]

Top15.reset_index(inplace = True)
Top15['Continent'] = None

# loop that add the coresponding continent to the country
for country in Top15['Country']:
    index_country = ((Top15.loc[Top15['Country'] == country]) # seek country index
                           .index)
    Top15.iloc[index_country,2] = ContinentDict[country] # add continent to country


# ---------- This is the part where I am having problem ---------- #
# create the 'answer' DataFrame
answer = pd.DataFrame(index=['Asia', 'Australia', 
                             'Europe', 'North America', 
                             'South America'], 
                      columns=['size', 'sum', 'mean', 'std'], dtype=float)

grouped = Top15.groupby('Continent')      # group countries by continent

answer['size'] = grouped.size()
answer['sum'] = grouped['PopEst'].sum()
answer['mean'] = grouped['PopEst'].mean()
answer['std'] = grouped['PopEst'].std()

I got at the line answer['mean'] = grouped['PopEst'].mean() , the error:

DataError: No numeric types to aggregate

I don't know where the problem is.

PopEst contains numeric values. For example, the population estimate of China is 1.36765e+09 people.

This is the DataFrame Top15 returned by answer_one() I have to work on:

    Country             PopEst      Continent  
0   Australia           2.3316e+07  Australia
1   Brazil              2.05915e+08 South America
2   Canada              3.52399e+07 North America
3   China               1.36765e+09 Asia
4   France              6.38373e+07 Europe
5   Germany             8.03697e+07 Europe
6   India               1.27673e+09 Asia
7   Iran                7.70756e+07 Asia
8   Italy               5.99083e+07 Europe
9   Japan               1.27409e+08 Asia
10  Russian Federation  1.435e+08   Europe
11  South Korea         4.98054e+07 Asia
12  Spain               4.64434e+07 Europe
13  United Kingdom      6.3871e+07  Europe
14  United States       3.17615e+08 North America

This is what Top15.to_dict() returns me:

{'Country': {0: 'Australia',
  1: 'Brazil',
  2: 'Canada',
  3: 'China',
  4: 'France',
  5: 'Germany',
  6: 'India',
  7: 'Iran',
  8: 'Italy',
  9: 'Japan',
  10: 'Russian Federation',
  11: 'South Korea',
  12: 'Spain',
  13: 'United Kingdom',
  14: 'United States'},
 'PopEst': {0: 23316017.316017315,
  1: 205915254.23728815,
  2: 35239864.86486486,
  3: 1367645161.2903225,
  4: 63837349.39759036,
  5: 80369696.96969697,
  6: 1276730769.2307692,
  7: 77075630.25210084,
  8: 59908256.880733944,
  9: 127409395.97315437,
  10: 143500000.0,
  11: 49805429.864253394,
  12: 46443396.2264151,
  13: 63870967.741935484,
  14: 317615384.61538464},
 'Continent': {0: 'Australia',
  1: 'South America',
  2: 'North America',
  3: 'Asia',
  4: 'Europe',
  5: 'Europe',
  6: 'Asia',
  7: 'Asia',
  8: 'Europe',
  9: 'Asia',
  10: 'Europe',
  11: 'Asia',
  12: 'Europe',
  13: 'Europe',
  14: 'North America'}}
LutinRose
  • 103
  • 2
  • 11
  • What is `answer_one()` ? Does `PopEst` contain Numeric values?:( – harvpan Aug 02 '19 at 17:36
  • answer_one() is (as you guessed) a previous answer. It just return a DataFrame with 15 countries (the couries with the most publications). Yes, PopEst contains numeric values. It is the estimate population of each 15 countries. I found a workaround, but I still don't know why mean() doesn't work. This is my work around : ```answer['mean'] = grouped['PopEst'].sum() / grouped['PopEst'].size()``` – LutinRose Aug 02 '19 at 17:36
  • We do not have the access to `answer_one()`. Can you simply post the dataframe that causes the problem? – harvpan Aug 02 '19 at 17:41
  • I updated my post. The DataFrame that cause the problem is **Top15**. Can the problem come from continents where only **1 country** is present, so mean cannot be applied because there is only one value to calculate the mean on? – LutinRose Aug 02 '19 at 17:53
  • Can you copy and paste or do `top15.to_dict()` and post the output? It is very hard to reproduce a dataframe from the picture. (see [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for help.) – harvpan Aug 02 '19 at 17:55
  • check the type of grouped['PopEst'] using type(grouped['PopEst']).If it's an object or string type convert it into numerical type using grouped['PopEst'].astype(float or int). After that apply the mean function. – The Mask Aug 02 '19 at 18:19
  • This is a bug of pandas. Actually, some of your data is not numeric, but sum and std function did not raise an error. If you read the source code, you will find sum does not have its individual function, but mean and std have. Futhermore, mean function check if all the number is numeric. – Yuan Aug 02 '19 at 18:21
  • ```type(grouped['PopEst'])``` returns me: `````` Using ```grouped['PopEst'].astype(float)``` returns me the error : Cannot access callable attribute 'astype' of 'SeriesGroupBy' objects, try using the 'apply' method – LutinRose Aug 02 '19 at 18:27

2 Answers2

1

This is a bug of Pandas, Pandas still do sum and prod calculation in groupby even if data is not numerical. I checked the source code, the bug appears in line 1373 of site-packages\pandas\core\groupby\groupby.py. It writes:

                except Exception:
                    pass

if you print the error, you will probably find 'No numeric types to aggregate' too.

As a walk-around solution, you could change the data to numerical before doing any calculations by using:

df['column'] = pd.to_numeric(df['column'])

some post may tell you to add errors='coerce' inside pd.to_numeric so that non-numerical element will be replace by na and won't raise an error. However, in many situations, it means some errors in the data. We need to fix the data rather than silencing the error.

Yuan
  • 462
  • 4
  • 12
0

This was the best that I could come up with

    final["Sum"]= final.groupby(['Cont_bin']).PopEst.transform('sum')
    final['Sum'] = pd.to_numeric(final['Sum'])
    final["PopEst"]= pd.to_numeric(final["PopEst"])
    final["Size"]=pd.to_numeric(final["Size"])
    final["Mean"]=final.groupby(['Cont_bin', 'Size']).PopEst.transform('mean')
   
    return(final)

you will need to coerce Mean into a numeric to get the standard deviation