0

I'm not an expert in R...

In my training data, there is a field called Source which has 30 levels, I just want to keep the top 2 levels since they are the majority, and need to change all the other 28 levels in to 'Other'. In this way, it will be easier for me to apply One-Hot Encoding later.

I have checked solutions here:Solution 1 and Solution 2

And I still got stucked...

Here are the major solutions I tried:

train is the original training data, x_train is just a copy. Source is a factor variable. The top 2 levels are 'S122' which is level1, and 'S133' which is level8

Try 1

Here I'm using which, so that I don't need to convert factor into character first. In fact, before using which, I tried to convert factor into character. The results are the same, didn't work.... After running the code here, nothing changed but just added 1 more level called 'Other'...

x_train <- train
levels(x_train$Source) <- c(levels(x_train$Source), "Other")
x_train$Source[which((x_train$Source != 'S122') && (x_train$Source != 'S133'))] <- 'Other'
str(x_train$Source)

Meanwhile, in this case, I am not using methods like revalue()because there are 28 levels need to be changed, I don't want to write 28 values in a method....

Try 2

Then, I changed to a very simply way, iteration... I tried while loop too, didn't work either.

x_train <- train
for (i in 1:30) {
  if (i == 1 || i == 8) {
    next
  }
  levels(x_train$Source)[i] <- 'Other'
}

After using this method, not all the 28 levels will be changed, and I have realized, while the values of those levels have been changed, the length of the total levels is changing too and the index will be changed. That's why I changed to while loop but still it didn't work....

Therefore, is there any way for me to just keep the top 2 levels and change all the other levels into 'Other'?

divibisan
  • 11,659
  • 11
  • 40
  • 58
Cherry Wu
  • 3,844
  • 9
  • 43
  • 63
  • Did you resolve this issue? If so, please share, if not, please add a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – shayaa Jul 27 '16 at 15:00
  • Yeah, I solved the problem in a simply way but looks silly, I will share that after work today~ – Cherry Wu Jul 27 '16 at 16:02

2 Answers2

1

This is not a reproducible example, since you do not provide the data but assuming that your factor is part of train, you can use.

levels(train$source) <-c("S122", "S133", rep("Other",3))

For example, imagine the titanic data.

titanic <- reshape2::melt(Titanic)
head(titanic)
  Class    Sex   Age Survived value
1   1st   Male Child       No     0
2   2nd   Male Child       No     0
3   3rd   Male Child       No    35
4  Crew   Male Child       No     0
5   1st Female Child       No     0
6   2nd Female Child       No     0

Now, suppose that I wanted to relabel the factors such that the highest two factors are in a class, and the other factors are in another class. I do not need any for loops. I just write

 levels(titanic$Class) <-c("High", "High", "Low", "Low")

Now when I look at the levels I get

titanic
   Class    Sex   Age Survived value
1   High   Male Child       No     0
2   High   Male Child       No     0
3    Low   Male Child       No    35
4    Low   Male Child       No     0
5   High Female Child       No     0
6   High Female Child       No     0
shayaa
  • 2,787
  • 13
  • 19
  • This is a good inspiration. While I will choose this as the solution, 1 thing need to correct, the `levels(train$source) <-c("S122", rep("Other",6), "S133", rep("Other",22))`. Because in my case, "S133" is level 8 but it has top 2 counts, "S122" is level 1 and it has top 1 counts, I need to change those levels between "S122" and "S133" in to "Other", and change those after S133 into "Other" too. – Cherry Wu Jul 30 '16 at 06:43
  • Yes, that looks right. Next time, you could help people who want to help you by sharing a [minimal dataset which resembles your problem](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). That way, I would know the order of the levels of your factors. – shayaa Jul 30 '16 at 06:46
0

Finally, I solved this problem, but the solution is not very intelligent. So, if there will be better solution for this problem, feel free to post it here.

Let's recall the major part I mentioned in the question:

x_train is the copy of train.

x_train$Source has 30 levels, level1 is 'S122' and level8 is 'S133', I just want to keep these 2 levels and set the other 28 levels as 'Other'.

If this description is still not clear to you, here's the example:

Origional x_train$Source 30 levels

S122, S123, S124.., S133, S134,....

Final x_train$Source levels

S122, Other, S133

My question was how to get the final result, how to change 30 levels into 3 levels

Here's the solution:

x_train <- train
summary(x_train$Source)
levels(x_train$Source)[2] <- 'Other'
for (i in 3:7) {
  levels(x_train$Source)[3] <- 'Other'
}
summary(x_train$Source)
for (j in 1:22) {
  levels(x_train$Source)[4] <- 'Other'
}
summary(x_train$Source)

As you can see, in the code, there is hard coding part, which is not good.

So, if there are better solution, very welcome to post it here!

Cherry Wu
  • 3,844
  • 9
  • 43
  • 63
  • Cherry, please see my edits to this question, clarifying why it would be best to use my solution. – shayaa Jul 29 '16 at 09:41
  • 1
    Hi Shayaa, thank you very much for the patience to modify your solution, now I can understand what do you mean. Just added a comment under your solution to let it fit for my case. But your solution is a good inspiration and it's a good learning experience for me. Thank you very much! – Cherry Wu Jul 30 '16 at 06:45