4

Take the following code to select only alphanumeric strings from a list of strings:

isValid = function(string){
  return(grep("^[A-z0-9]+$", string))
}

strings = c("aaa", "test@test.com", "", "valid")

print(Filter(isValid, strings))

The output is [1] "aaa" "test@test.com".

Why is "valid" not outputted, and why is "test@test.com" outputted?

clb
  • 715
  • 9
  • 23

2 Answers2

5

The Filter function accepts a logical vector, you supplied a numeric. Use grepl:

isValid = function(string){
  return(grepl("^[A-z0-9]+$", string))
}

strings = c("aaa", "test@test.com", "", "valid")

print(Filter(isValid, strings))
[1] "aaa"   "valid"

Why didn't grep work? It is due to R's coercion of numeric values to logical and the weirdness of Filter.

Here's what happened, grep("^[A-z0-9]+$", string) correctly returns 1 4. That is the index of matches on the first and fourth elements.

But that is not how Filter works. It runs the condition on each element with as.logical(unlist(lapply(x, f))).

So it ran isValid(strings[1]) then isValid(strings[2]) and so on. It created this:

[[1]]
[1] 1

[[2]]
integer(0)

[[3]]
integer(0)

[[4]]
[1] 1

It then called unlist on that list to get 1 1 and turned that into a logical vector TRUE TRUE. So in the end you got:

strings[which(c(TRUE, TRUE))]

which turned into

strings[c(1,2)]
[1] "aaa"           "test@test.com"

Moral of the story, don't use Filter :)

Pierre L
  • 28,203
  • 6
  • 47
  • 69
2

You could go the opposite direction with this and exclude any strings with punctuation, i.e.

isValid <- function(string){
  v1 <- string[!string %in% grep('[[:punct:]]', string, value = TRUE)] 
  return(v1[v1 != ''])
  }
isValid(strings)
#[1] "aaa"   "valid"
Sotos
  • 51,121
  • 6
  • 32
  • 66