3

I'm trying to solve a problem for a course in Python and found someone has implemented solutions for the same problem in github. I'm just trying to understand the solution given in github.

I have a pandas dataframe called Top15 with 15 countries and one of the columns in the dataframe is 'HighRenew'. This column stores the % of renewable energy used in each country. My task is to convert the column values in 'HighRenew' column into boolean datatype.

If the value for a particular country is higher than the median renewable energy percentage in all the 15 countries then I should encode it as 1 otherwise it should a 0. The 'HighRenew' column is sliced out as a Series from the dataframe, which is copied below.

Country
China                  True
United States         False
Japan                 False
United Kingdom        False
Russian Federation     True
Canada                 True
Germany                True
India                 False
France                 True
South Korea           False
Italy                  True
Spain                  True
Iran                  False
Australia             False
Brazil                 True
Name: HighRenew, dtype: bool

The github solution is implemented in 3 steps, of which I understand the first 2 but not the last one where lambda function is used. Can someone explain how this lambda function works?

median_value = Top15['% Renewable'].median()
Top15['HighRenew'] = Top15['% Renewable']>=median_value
Top15['HighRenew'] = Top15['HighRenew'].apply(lambda x:1 if x else 0)
thileepan
  • 619
  • 3
  • 8
  • 18

3 Answers3

6

lambda represents an anonymous (i.e. unnamed) function. If it is used with pd.Series.apply, each element of the series is fed into the lambda function. The result will be another pd.Series with each element run through the lambda.

apply + lambda is just a thinly veiled loop. You should prefer to use vectorised functionality where possible. @jezrael offers such a vectorised solution.

The equivalent in regular python is below, given a list lst. Here each element of lst is passed through the lambda function and aggregated in a list.

list(map(lambda x: 1 if x else 0, lst))

It is a Pythonic idiom to test for "Truthy" values using if x rather than if x == True, see this answer for more information on what is considered True.

jpp
  • 159,742
  • 34
  • 281
  • 339
3

I think apply are loops under the hood, better is use vectorized astype - it convert True to 1 and False to 0:

Top15['HighRenew'] = (Top15['% Renewable']>=median_value).astype(int)

lambda x:1 if x else 0

means anonymous function (lambda function) with condition - if True return 1 else return 0.

For more information about lambda function check this answers.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • So, in this solution, lambda x:1 if x else 0 does the lambda function consider each values in the Series, and gives a – thileepan Mar 02 '18 at 13:20
  • @thileepan - exactly. – jezrael Mar 02 '18 at 13:23
  • So, the lambda function takes the value of the Series and returns a 1 if the value is True else a 0. what if I have a Series like this pd.Series( data = ['Germany', 'India' 'France'], index = [1, ' ', None]), and apply the same lambda function 'lambda x: 1 if x else 0' how will the function work then? – thileepan Mar 02 '18 at 13:29
  • In that case `apply` would throw an error because `' '` doesn't support comparison. I believe the same would be true for pretty much any method. You would need to explicitly fill null values. There's a whole host of information regarding best practices for that process. – Brandon Barney Mar 02 '18 at 13:33
  • More specifically, a str is unorderable. A string can be compared, just not compared numerically. – Brandon Barney Mar 02 '18 at 13:33
  • @BrandonBarney It doesn't throw an error when I tried, it outputs a Series like this, Germany 1 India 1 France 0 dtype: int64 – thileepan Mar 02 '18 at 13:37
  • @thileepan - Sure, it return `0` for `None` because some another values [link](https://stackoverflow.com/a/20421262/2901002) are processes like `False` – jezrael Mar 02 '18 at 13:39
  • @thileepan - check it `s = pd.Series( ['Germany', 'India', 'France'], index = [1, ' ', None]) print (s.apply(lambda x:1 if x else 0)) print (s.index.map(lambda x:1 if x else 0))` – jezrael Mar 02 '18 at 13:39
  • @thileepan - And `string`s are processes like `True`s - [link](https://stackoverflow.com/a/20420996/2901002) – jezrael Mar 02 '18 at 13:41
  • @thileepan The lambda `1 if x else 0` will work because `' '` does have a truth value, but won't work for comparing the value. So `1 if x else 0` gives 1 for any non-null, but `1 if x > y else 0` will fail. I read your apply in the original context of comparison. – Brandon Barney Mar 02 '18 at 13:49
  • 1
    @jezrael yes, I tried it and I got this value `[1 1 0]` , which means, the lambda function considers empty string ' ' as True, and isn't it supposed to be considered as False according to this [link] (https://stackoverflow.com/questions/20420934/python-booleans-if-x-vs-if-x-true-vs-if-x-is-true/20421262#20421262) ? – thileepan Mar 02 '18 at 13:51
  • @BrandonBarney according to this [link](https://stackoverflow.com/questions/20420934/python-booleans-if-x-vs-if-x-true-vs-if-x-is-true/20420996#20420996) the ` ' ' ` has a False value and not a True value. so, i'm confused. – thileepan Mar 02 '18 at 13:55
  • 1
    @thileepan - try difference `''` or `""`- empty string and `' '` or `'""'` - not empty - first because whitespace and second because `""` – jezrael Mar 02 '18 at 13:57
  • 1
    Yeah, my mistake. I meant the non-empty string `' '` since the non-empty version has some value, whereas the empty `''` has no value. – Brandon Barney Mar 02 '18 at 13:58
  • @jezrael yes you are correct, the `''` and `""` is taken as a False because it is actually empty as opposed to `' '` and `'""'` which are not empty – thileepan Mar 02 '18 at 14:04
  • 1
    @BrandonBarney you are exactly correct, `' '` is non-empty thus is True and `''` is empty and thus False. – thileepan Mar 02 '18 at 14:07
0

Instead of using workarounds or lambdas, just use Panda's built-in functionality meant for this problem. The approach is called masking, and in essence we use comparators against a Series (column of a df) to get the boolean values:

import pandas as pd
import numpy as np

foo = [{
    'Country': 'Germany',
    'Percent Renew': 100
}, {
    'Country': 'Germany',
    'Percent Renew': 75
}, {
    'Country': 'China',
    'Percent Renew': 25
}, {
    'Country': 'USA',
    'Percent Renew': 5
}]

df = pd.DataFrame(foo, index=pd.RangeIndex(0, len(foo)))

df

#| Country   | Percent Renew |
#| Germany   | 100           |
#| Australia | 75            |
#| China     | 25            |
#| USA       | 5             |

np.mean(df['Percent Renew'])
# 51.25

df['Better Than Average'] = df['Percent Renew'] > np.mean(df['Percent Renew'])

#| Country   | Percent Renew | Better Than Average |
#| Germany   | 100           | True
#| Australia | 75            | True
#| China     | 25            | False
#| USA       | 5             | False

The reason specifically why I propose this over the other solutions is that masking can be used for a host of other purposes as well. I wont get into them here, but once you learn that pandas supports this kind of functionality, it becomes a lot easier to perform other data manipulations in pandas.

EDIT: I read needing boolean datatype as needing True False and not as needing the encoded version 1 and 0 in which case the astype that was proposed will sufficiently convert the booleans to integer values. For masking purposes though, the 'True' 'False' is needed for slicing.

Brandon Barney
  • 2,382
  • 1
  • 9
  • 18