Ifelse in R with multiple categorical conditions

Question

I have a data set, dt.train2 with 1500 different observations and 130 variables. One of them is languages and it can be english, french, arabic...

I want to create a ifelse string that gives attributes 1 for english, 2 for french, 3 for spanish and 0 for anything else. I have no idea how to do it.

dt.train2[, language_string := ifelse(language == "english", 
                                      1, 
                                      ifelse(language == "french", 
                                             2, 
                                             ifelse(language == "spanish",3)]

I'm using this to run linear model about sales.

You're just missing the final else condition, `language == "spanish",3, 0)`, you need that **`, 0`**. — Gregor Thomas, Nov 17 '17 at 21:04
That said, this seems like a terrible way to prepare for a linear model, unless you really think that whatever language effect there is, the French effect is two times the English effect, and the Spanish effect three times the English effect. — Gregor Thomas, Nov 17 '17 at 21:07
I'd say make a table and do an update join https://stackoverflow.com/questions/42587214/substitute-dt1-x-with-dt2-y-when-dt1-x-and-dt2-x-match-in-r — Frank, Nov 17 '17 at 21:10

Gregor Thomas · Answer 1 · 2017-11-17T21:34:04.627

You're almost there with the ifelse(), you just need the final else result (and a couple missing closing parentheses).

dt.train2[, language_string := ifelse(
  language == "english", 1,
    ifelse(language == "french", 2,
      ifelse(language == "spanish", 3, 0)
    )
  )
]

A couple other ways you could do this:

Make a lookup table and join:

# sample data
dt = data.table(language = c("english", "french", "spanish", "arabic", "chinese", "pig latin"))


lookup = data.table(language = c("english", "french", "spanish"),
                    language_string = c(1, 2, 3))

dt2 = merge(dt, lookup, by = "language", all.x = TRUE)
dt2[is.na(language_string), language_string := 0]

The above lookup table method is probably the nicest for scalability. However, for such a small number of encodings, you could also just set each of them:

# start with the default, 0
dt[, language_string := 0 ]
# then do each of the exceptions
dt[lanuage == "english", language_string := 1]
dt[language == "french", language_string := 2]
dt[language == "spanish", language_string := 3]

Fyi, one idiom for a lookup is like `dt[, v := 0][lookup, on=.(language), v := language_string]` so that the column is added by reference instead of building a new table. — Frank, Nov 17 '17 at 21:42

Beau · Answer 2 · 2017-11-21T01:00:48.663

1

I agree with the comments that if this is for a linear model, it is a bad way to go about it. If you insist on languages as predictors, it is better practice to make a series of dummy variables. In this case, you would add 3 predictors ("english", "french", "arabic") which all take values of 0 or 1.

Either way, here is my dplyr take on the problem, which utilizes a CASE WHEN SQL-style syntax and is easier to read.

require(tidyverse)
dt.train2 <- dt.train2 %>%
          mutate(language_string = case_when(language == "english" ~ 1,
                                             language == "french" ~ 2,
                                             language == "spanish" ~ 3,
                                             TRUE ~ 0))

edited Nov 21 '17 at 01:00

answered Nov 17 '17 at 21:24

Beau

161
4

You're right. I'm really new to this. I'll try your solution. Thanks a lot! – João Barbosa Nov 19 '17 at 22:03
@Gregor Thanks for catching that. I edited it with the catch-all of `TRUE ~ 0` for other cases. – Beau Nov 21 '17 at 01:01

Sebastian Rothbucher · Answer 3 · 2017-11-17T21:42:06.517

0

check this:

 t <- 'es'
 tt <- (if (t == 'en') 1 else if (t == 'fr') {2} else if (t == 'es') 3 else 0)
 tt
 [1] 0

you can then apply it to like a dataframe:

df <- data.frame(l=c('en', 'fr', 'es', 'sthelse'))
df$lnum <- sapply(df$l, function(t){(if (t == 'en') 1 else if (t == 'fr') {2} else if (t == 'es') 3 else 0)})
df
        l lnum
1      en    1
2      fr    2
3      es    3
4 sthelse    0

edited Nov 17 '17 at 21:42

answered Nov 17 '17 at 21:11

Sebastian Rothbucher

1,423
10
8

Could alternately write this algebraically, `(t=="en") + 2*(t=="fr") + 3*(t=="es")`, so that it's vectorized. – Frank Nov 17 '17 at 21:13
Debated downvoting (didn't), but the un-vectorized way is really terrible, even on a few hundred rows it will be noticeably slow. And the `unlist(lapply())` is overly complicated - just use `sapply` if you're going to do that. But vectorized `ifelse()` (or Frank's clever vectorized solution) are much better. – Gregor Thomas Nov 17 '17 at 21:25
you're quite right indeed - changed it to sapply. Thanks! – Sebastian Rothbucher Nov 17 '17 at 21:42
Or just `c(en=1, fr=2, es=3)[t]` – Frank Harrell Nov 24 '17 at 12:42

score 0 · Answer 4 · answered Nov 17 '17 at 22:04

0

language_options <- c("english", "french", "spanish", ...)
dt.train2[, language_string := match(language, language_options)]

answered Nov 17 '17 at 22:04

Nathan Werth

5,093
18
25

Ifelse in R with multiple categorical conditions

4 Answers4

Linked

Related