0

I have a data set, dt.train2 with 1500 different observations and 130 variables. One of them is languages and it can be english, french, arabic...

I want to create a ifelse string that gives attributes 1 for english, 2 for french, 3 for spanish and 0 for anything else. I have no idea how to do it.

dt.train2[, language_string := ifelse(language == "english", 
                                      1, 
                                      ifelse(language == "french", 
                                             2, 
                                             ifelse(language == "spanish",3)]

I'm using this to run linear model about sales.

Michael
  • 41,989
  • 11
  • 82
  • 128
João Barbosa
  • 1
  • 1
  • 1
  • 1
  • You're just missing the final else condition, `language == "spanish",3, 0)`, you need that **`, 0`**. – Gregor Thomas Nov 17 '17 at 21:04
  • 1
    That said, this seems like a terrible way to prepare for a linear model, unless you really think that whatever language effect there is, the French effect is two times the English effect, and the Spanish effect three times the English effect. – Gregor Thomas Nov 17 '17 at 21:07
  • 1
    I'd say make a table and do an update join https://stackoverflow.com/questions/42587214/substitute-dt1-x-with-dt2-y-when-dt1-x-and-dt2-x-match-in-r – Frank Nov 17 '17 at 21:10
  • You're completely right! Thanks! – João Barbosa Nov 19 '17 at 22:00

4 Answers4

3

You're almost there with the ifelse(), you just need the final else result (and a couple missing closing parentheses).

dt.train2[, language_string := ifelse(
  language == "english", 1,
    ifelse(language == "french", 2,
      ifelse(language == "spanish", 3, 0)
    )
  )
]

A couple other ways you could do this:

Make a lookup table and join:

# sample data
dt = data.table(language = c("english", "french", "spanish", "arabic", "chinese", "pig latin"))


lookup = data.table(language = c("english", "french", "spanish"),
                    language_string = c(1, 2, 3))

dt2 = merge(dt, lookup, by = "language", all.x = TRUE)
dt2[is.na(language_string), language_string := 0]

The above lookup table method is probably the nicest for scalability. However, for such a small number of encodings, you could also just set each of them:

# start with the default, 0
dt[, language_string := 0 ]
# then do each of the exceptions
dt[lanuage == "english", language_string := 1]
dt[language == "french", language_string := 2]
dt[language == "spanish", language_string := 3]
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 2
    Fyi, one idiom for a lookup is like `dt[, v := 0][lookup, on=.(language), v := language_string]` so that the column is added by reference instead of building a new table. – Frank Nov 17 '17 at 21:42
  • Perfect, that's what I was looking for! Thank you very much – João Barbosa Nov 19 '17 at 22:02
1

I agree with the comments that if this is for a linear model, it is a bad way to go about it. If you insist on languages as predictors, it is better practice to make a series of dummy variables. In this case, you would add 3 predictors ("english", "french", "arabic") which all take values of 0 or 1.

Either way, here is my dplyr take on the problem, which utilizes a CASE WHEN SQL-style syntax and is easier to read.

require(tidyverse)
dt.train2 <- dt.train2 %>%
          mutate(language_string = case_when(language == "english" ~ 1,
                                             language == "french" ~ 2,
                                             language == "spanish" ~ 3,
                                             TRUE ~ 0))
Beau
  • 161
  • 4
0

check this:

 t <- 'es'
 tt <- (if (t == 'en') 1 else if (t == 'fr') {2} else if (t == 'es') 3 else 0)
 tt
 [1] 0

you can then apply it to like a dataframe:

df <- data.frame(l=c('en', 'fr', 'es', 'sthelse'))
df$lnum <- sapply(df$l, function(t){(if (t == 'en') 1 else if (t == 'fr') {2} else if (t == 'es') 3 else 0)})
df
        l lnum
1      en    1
2      fr    2
3      es    3
4 sthelse    0
  • Could alternately write this algebraically, `(t=="en") + 2*(t=="fr") + 3*(t=="es")`, so that it's vectorized. – Frank Nov 17 '17 at 21:13
  • Debated downvoting (didn't), but the un-vectorized way is really terrible, even on a few hundred rows it will be noticeably slow. And the `unlist(lapply())` is overly complicated - just use `sapply` if you're going to do that. But vectorized `ifelse()` (or Frank's clever vectorized solution) are much better. – Gregor Thomas Nov 17 '17 at 21:25
  • you're quite right indeed - changed it to sapply. Thanks! – Sebastian Rothbucher Nov 17 '17 at 21:42
  • Or just `c(en=1, fr=2, es=3)[t]` – Frank Harrell Nov 24 '17 at 12:42
0
language_options <- c("english", "french", "spanish", ...)
dt.train2[, language_string := match(language, language_options)]
Nathan Werth
  • 5,093
  • 18
  • 25