1

I've like to replace my levels dog1 ... dog4 and cat1 ... cat4 by only two levels DOG and CAT, but if I use grepl my output as only NAs.

In my code:

x  <- (rep(c("dog1","dog2","dog3","dog4","cat1","cat2","cat3","cat4"),2)) #Levels
y<-rnorm(16)
d<-data.frame(cbind(x,y))
head(d)

     x                 y
1 dog1 0.906357739138289
2 dog2 0.974674552504268
3 dog3 0.664045049199848
4 dog4 0.911777985232099
5 cat1 0.246575548162824
6 cat2 0.758069789161901


d$x[grepl("dog", d$x)] <- "DOG" 

Warning message: In [<-.factor(*tmp*, grepl("dog", d$x), value = c(NA, NA, NA, : invalid factor level, NA generated

d$x[grepl("cat", d$x)] <- "CAT"

Warning message:
In `[<-.factor`(`*tmp*`, grepl("cat", d$x), value = c(NA_integer_,  :
  invalid factor level, NA generated

head(d)

     x                 y
1 <NA> 0.906357739138289
2 <NA> 0.974674552504268
3 <NA> 0.664045049199848
4 <NA> 0.911777985232099
5 <NA> 0.246575548162824
6 <NA> 0.758069789161901

My desirable output if the code run OK is:

head(d)

     x                 y
1 DOG  0.906357739138289
2 DOG  0.974674552504268
3 DOG  0.664045049199848
4 DOG  0.911777985232099
5 CAT  0.246575548162824
6 CAT  0.758069789161901
oguz ismail
  • 1
  • 16
  • 47
  • 69
Isabel
  • 323
  • 1
  • 11

3 Answers3

2

You could try creating the data frame with strings as factors false:

d <- data.frame(cbind(x,y), stringsAsFactors=FALSE)
d$x[grepl("dog", d$x)] <- "DOG"
d$x[grepl("cat", d$x)] <- "CAT" 
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

Key here (as Tim has hinted) is understanding how factor variables, while similar on the surface, are actually completely different from character variables.

Here is one way to access and update the levels of your factor:

levels(d$x)
# [1] "cat1" "cat2" "cat3" "cat4" "dog1" "dog2" "dog3" "dog4"

levels(d$x)[grepl("dog", levels(d$x))] <- "DOG"
levels(d$x)[grepl("cat", levels(d$x))] <- "CAT"
head(d)
#     x                   y
# 1 DOG -0.0489713202962167
# 2 DOG  -0.548503649991368
# 3 DOG   0.460493884654479
# 4 DOG   0.143044665735075
# 5 CAT   -2.13008189672678
# 6 CAT  -0.136767747543626

levels(d$x)
[1] "CAT" "DOG"
s_baldur
  • 29,441
  • 4
  • 36
  • 69
0

Yet another version but using regex here. We capture everything until a digit is found and turn it to upper case. (\\U).

d$x <- sub("(.*)\\d+", "\\U\\1", d$x, perl = TRUE)
d$x
 #[1] "DOG" "DOG" "DOG" "DOG" "CAT" "CAT" "CAT" "CAT" "DOG" "DOG" "DOG" "DOG" 
 #    "CAT" "CAT" "CAT" "CAT"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213