I'm not an expert in R...
In my training data, there is a field called Source
which has 30 levels, I just want to keep the top 2 levels since they are the majority, and need to change all the other 28 levels in to 'Other'. In this way, it will be easier for me to apply One-Hot Encoding later.
I have checked solutions here:Solution 1 and Solution 2
And I still got stucked...
Here are the major solutions I tried:
train
is the original training data, x_train
is just a copy. Source
is a factor variable.
The top 2 levels are 'S122' which is level1, and 'S133' which is level8
Try 1
Here I'm using which
, so that I don't need to convert factor into character first. In fact, before using which
, I tried to convert factor into character. The results are the same, didn't work.... After running the code here, nothing changed but just added 1 more level called 'Other'...
x_train <- train
levels(x_train$Source) <- c(levels(x_train$Source), "Other")
x_train$Source[which((x_train$Source != 'S122') && (x_train$Source != 'S133'))] <- 'Other'
str(x_train$Source)
Meanwhile, in this case, I am not using methods like revalue()
because there are 28 levels need to be changed, I don't want to write 28 values in a method....
Try 2
Then, I changed to a very simply way, iteration... I tried while
loop too, didn't work either.
x_train <- train
for (i in 1:30) {
if (i == 1 || i == 8) {
next
}
levels(x_train$Source)[i] <- 'Other'
}
After using this method, not all the 28 levels will be changed, and I have realized, while the values of those levels have been changed, the length of the total levels is changing too and the index will be changed. That's why I changed to while
loop but still it didn't work....
Therefore, is there any way for me to just keep the top 2 levels and change all the other levels into 'Other'?