1

I am encoding categorical variables in my dataframe. I found a nice pythonic way to do this with lambda expressions. For instance, the following line of code replaces the gender categories "male" and "female" (encoded as strings) with values 0 and 1.

train_frame['Sex'] = train_frame['Sex'].apply(lambda x : 1 if x =='male' else 0)

Now my question is, can i also do this but then for more than two categories? (So more then 1 if in the expression so to say).

I am trying to do this for the place where people Embarked on a ship, where I want to represent the place where people boarded the ship with an integer (Some background info: S = Southampton, C = Cherbourg, Q = Queenstown)

I tried to do something like this, but it does not work:

#Southampton = 0, Cherbourg = 1, Queenstown = 2
train_frame['Embarked'] = train_frame['Embarked'].apply(lambda x: 0 if x =='S', 1 if x=='C' else 2 )

Can somebody explain me if it is possible to use lambda-expressions with multiple if-statements? and, slightly off-topic: is there a more pythonic way to encode categoricals in a dataframe?

Psychotechnopath
  • 2,471
  • 5
  • 26
  • 47
  • 2
    `lambda` uses an expression, not statements. [conditional expressions](https://docs.python.org/3/reference/expressions.html#conditional-expressions) can be chained, though. – Yann Vernier Dec 17 '19 at 10:52
  • I do not agree with the duplicate as I am specifically asking about multiple if-statements in lambda-expressions, and the question that my question should be a duplicate from is totally different (It is about adding columns to a dataframe, something I am not doing). – Psychotechnopath Dec 17 '19 at 11:06
  • @Psychotechnopath nromashchenko already answered that aspect https://stackoverflow.com/a/59372653/1011724 so is it still necessary to reopen? – Dan Dec 17 '19 at 11:10
  • @Dan As the question is not a duplicate IMO, yes, it should be re-opened. – Psychotechnopath Dec 17 '19 at 11:14
  • If the question is really about chaining multiple if statements, then you shouldn't have included all the extra parts about categorical variables or mentioned pandas. The fact that you did makes it seem like the real question is how to encode categoricals in pandas. Either way, it's not really a duplicate since you should be doing OHE anyway. – Dan Dec 17 '19 at 11:17
  • Yes but as I'm not that experienced I decided to include more context as I wasn't sure if the problem was using the apply function or the chaining of the multiple if-statements in the expression. Anyways, my question is answered and I learned some new stuff so thank you very much =). – Psychotechnopath Dec 17 '19 at 11:24
  • @Psychotechnopath - reopened, I have no problem with it. – jezrael Dec 17 '19 at 11:42
  • 1
    @jezrael tyvm =) – Psychotechnopath Dec 17 '19 at 11:45

3 Answers3

2

One approach is to use a dict

Ex:

data = {'S': 0, 'C': 1}
train_frame['Embarked_N'] = train_frame['Embarked'].map(data).fillna(2)
Rakesh
  • 81,458
  • 17
  • 76
  • 113
2

Using .apply is generally not a good idea if you can avoid it. In this case, I would suggest rather using .get_dummies or scikit-learn's transformers. as you probably want to encode these into multiple columns. Alternatively, you can use replace

train_frame['Sex'] = train_frame['Sex'].replace({
    "S": 0,
    "C": 1,
    "X": 2,
})

you could use a defaultdict if you really want that else. But I recommend rather going with get_dummies or sklearn (note if you want to keep them all in one column for some reason, sklearn has label_binarizer).

Dan
  • 45,079
  • 17
  • 88
  • 157
  • 1
    A `defaultdict` will create new entries when encountering unknown keys. The default argument of [dict.get](https://docs.python.org/3/library/stdtypes.html#dict.get) might be better. – Yann Vernier Dec 17 '19 at 11:00
  • @YannVernier how would you set that default in this case? – Dan Dec 17 '19 at 11:01
  • I'm not sure. I haven't really used pandas; perhaps its paired list of `to_replace` and `value` combined with a match-anything regex? Making a defaultdict-like subclass with `get`-like behaviour instead of `setdefault`-like is also an option. – Yann Vernier Dec 17 '19 at 11:07
  • Can you explain why using .apply isn't a good idea? And why would I want to encode them into multiple columns? I simply want to use the categorical feature to train my SVM on. – Psychotechnopath Dec 17 '19 at 11:10
  • 1
    Regarding `apply` - it's not efficient, [see here](https://stackoverflow.com/a/55557758/1011724). Regarding the treatment of categoricals, read up on [one hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f), or categorical encoding in general. It's too long a topic for a comment, but in short your method is imposing an artificial ordering of your categories i.e. 0 < 1 < 2. – Dan Dec 17 '19 at 11:13
1

Try

   lambda x: 0 if x== 'S' else 1 if x == 'C' else 2
nromashchenko
  • 111
  • 1
  • 6