1

Lets say I have two pandas DataFrames, df1 and df2, one containing the names and age of people, and the other detailing what they are studying. What is an efficient way to join the two, so that I have boolean fields of what each person is studying?

e.g. given the following

# df1
name  | age
------|----
John  | 24
Kelly | 49
Gemma | 18
Bob   | 29

# df2
name  | studies
------|----------
John  | education
John  | science
Kelly | science
Bob   | law
Bob   | commerce

How could I create the following dataframe with boolean values for each field of study?

name  | age | education | science | law   | commerce |
------|-----|-----------|---------|-------|----------|
John  | 24  | True      | True    | False | False    |
Kelly | 49  | False     | True    | False | False    |
Gemma | 18  | False     | False   | False | False    |
Bob   | 29  | False     | False   | True  | True     |
oska boska
  • 221
  • 2
  • 14

1 Answers1

3

Use get_dummies with max, then join and replace missing values only for columns from df22:

s = df2.set_index('name')['studies']
df22 = pd.get_dummies(s, prefix_sep='', prefix='', dtype=bool).max(level=0)
df = df1.join(df22, on='name').fillna(dict.fromkeys(df22.columns, False))
print (df)
    name  age  commerce  education    law  science
0   John   24     False       True  False     True
1  Kelly   49     False      False  False     True
2  Gemma   18     False      False  False    False
3    Bob   29      True      False   True    False
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Amazing, thanks so much. I encountered the error `TypeError: get_dummies() got an unexpected keyword argument 'dtype'` but managed to get around it by using `df22 = pd.get_dummies(s, prefix_sep='', prefix='').max(level=0).astype('bool')` instead. Might just be an issue with my data because its not as clean as the example I gave. – oska boska Sep 24 '18 at 09:19
  • @oskaboska - No, problem is your pandas version, `dtype` is implemented in `pandas 0.23.0` – jezrael Sep 24 '18 at 09:20
  • 1
    Ah ok, that makes more sense. Thanks again – oska boska Sep 24 '18 at 09:25
  • @jezrael - Tried `df['column_name'] = df['column_name'].astype('bool')` to get a **boolean** field in dataframe. With this, the boolean value is defaulted to `True`. How to default the **boolean** as `False`? – Love Putin Not War Jun 11 '20 at 05:52
  • @user12379095 - Do you think `df['column_name'] = ~(df['column_name'].astype('bool'))` ? – jezrael Jun 11 '20 at 05:53
  • Correct!! Thanks so much. If I may, could you please explain the code a bit? – Love Putin Not War Jun 11 '20 at 06:22
  • @user12379095 - sure, it convert True to False, False to True, in another words [invert boolean mask](https://stackoverflow.com/questions/15998188/how-can-i-obtain-the-element-wise-logical-not-of-a-pandas-series) – jezrael Jun 11 '20 at 06:23