0

I met a weird problem when I am using R, I'm using data.table:

Here, when I tried to convert those Province has count under 500 to "Other", the output changes the top count Provinces into index number

df <- fact_data[,.N,Province][N >= 500]$Province
df
fact_data[,Province := ifelse(Province %in% df, fact_data$Province, "Other")]
fact_data[,.N,Province][order(-N)]

Output: p1

But, this method worked well on those factor variables which values are in numeric format. For example, instead of using Province, if I use BranchNumber, the values look like "1", "3", I got the input like this, which is good:

p2

Do you know, why this happened and how to resolve the problem?

Cherry Wu
  • 3,844
  • 9
  • 43
  • 63
  • 3
    This is probably a side effect of `ifelse` which has a bad habit of changing the class of its return value unpredictably. Try `fact_data[ Province %in% df, Province := "Other" ]` instead. – Frank Oct 04 '16 at 01:54
  • 1
    Awesome! It works. I just need to change the previous logic to `df <- fact_data[,.N,Province][N < 500]$Province`, then with `fact_data[ Province %in% df, Province := "Other" ]`, I got what I want. Thank you very much!! – Cherry Wu Oct 04 '16 at 15:41
  • 1
    Cool. Fyi, you can also negate it like `!( Province %in% df )`, thought that might make the code more confusing (compared to changing the inequality). – Frank Oct 04 '16 at 16:04
  • It's a great learning! Thank you very much!! – Cherry Wu Oct 04 '16 at 17:32

1 Answers1

1

This is probably a side effect of ifelse, which has a bad habit of changing the class of its return value unpredictably. Try this instead:

fact_data[ !( Province %in% df ), Province := "Other" ] 

Generally, I would recommend working with character vectors as data.table columns instead of factors whenever possible.

Frank
  • 66,179
  • 8
  • 96
  • 180