3

New to Python here. I hope my question isn't entirely redundant - if it is, let me know and chalk it up to my inexperience with StackOverflow.

In any case, I'm working with the Titanic dataset from kaggle.com, and I'm looking to use a set of conditional statements to replace NaN 'values' throughout the Age column of the dataframe. Ultimately, I'd like to generate results based on the following conditions: 1) if age==NaN, and Title==(X or Y or Z), generate a random number in the 0-18 range 2) if age==NaN, and Title==(A or B or C), generate a random number in the 19-80 range

Note: 'Title' is a column with the title of individual listed (i.e. Mr., Mrs., Lord, etc.)

I found a similar situation here, but I haven't been able to adapt it to my case as it doesn't approach conditionality at all.

Here is my most recent attempt (per. the replies as this update)

Attempt 1

import random

mask_young = (df.Age.isnull()) & (df.Title.isin(Title_Young)) 
df.loc[mask_young, 'Age'] = df.loc[mask_young, 'Age'].apply(lambda x: np.random.randint(0,18))

mask_old = (df.Age.isnull()) & (df.Title.isin(Title_Old)) 
df.loc[mask_old, 'Age'] = df.loc[mask_old, 'Age'].apply(lambda x: np.random.randint(18,65))

mask_all = (df.Age.isnull()) & (df.Title.isin(Title_All)) 
df.loc[mask_all, 'Age'] = df.loc[mask_all, 'Age'].apply(lambda x: np.random.randint(0,65))

Result is no error, but no correction to NaN values in 'Age' column

alofgran
  • 427
  • 7
  • 18
  • 1
    Your first attempt is looping through a string, I think you meant `for age in df['Age']: `. But more importantly, when using pandas there's no need to loop in this situation. – elPastor Mar 20 '18 at 04:07
  • Thanks for the tip, @pshep123. I've taken that into account (see the edited code above), however, it's not providing the expected result. – alofgran Mar 21 '18 at 02:34

1 Answers1

5

You want to mask your DataFrame and then perform the operation on only the part of the DataFrame that matches your condition.

import numpy as np
import pandas as pd

mask1 = (df.Age.isnull()) & (df.Title == 'Master')
df.loc[mask1, 'Age'] = df.loc[mask1, 'Age'].apply(lambda x: np.random.randint(0,18))

If you really need the functionality of having multiple titles in a list, this can be accomplished by defining the list of titles you care about and then using isin. For example:

list1 = ['Master', 'Sir', 'Mr']
mask1 = (df.Age.isnull()) & (df.Title.isin(list1))
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    Thanks @ALollz - that helped, and I've gotten read of some of the errors I've had, but the code does not do what was intended. NaN values still exist in the Age column. Any other ideas? – alofgran Mar 21 '18 at 02:30
  • Are the `NaN` values being read in as actual null values recognized by python, or just the string `'NaN'`? Check the type with `df.dtypes`. Otherwise perhaps there's extra empty which space in the Title variables, so that they are actually ' Mr.' which will not match with 'Mr.'. You can see what they actually are by just selecting the column and index of individual elements with `df.loc[1,'Age']` or whatever index and column you want. – ALollz Mar 21 '18 at 02:50
  • Thanks ALollz! That did it. While I did have the sense to check whether they were strings or actually null values before I came to the forum, it turned out that one of my lists did not have the spaces before the titles as you suggested. Changed that one thing and it fixed it all. Thanks! – alofgran Mar 21 '18 at 03:25