1

Sample data

a<-c("hour","four","ruoh", "six", "high", "our")

I want to find all strings that contain o & u & h & are 4 characters but the order does not matter.

I want to return "hour","four","ruoh" this is my attempt

grepl("o+u+r", a) nchar(a)==4
bvowe
  • 3,004
  • 3
  • 16
  • 33
  • What about testing each separately. You first test (with grep) which elements of the vector contains "o", those who pass, you test if they has "u" and those who pass you test for "h". – Cris Nov 14 '18 at 23:26
  • @Cris is this the most simple approach to do so? – bvowe Nov 14 '18 at 23:28
  • 5
    "four" does not contain o & u & h. – neilfws Nov 14 '18 at 23:28
  • @neilfws I have now done a modification – bvowe Nov 14 '18 at 23:34
  • @bvowe i'm seeing right now that my solution won't work... – Cris Nov 14 '18 at 23:35
  • @bvowe it still says you want to return "four", and that you want strings with o & u & h. I think you mean o & u & r, as in the `grepl`. – neilfws Nov 14 '18 at 23:36
  • 1
    See [Regular Expressions: Is there an AND operator?](https://stackoverflow.com/questions/469913/regular-expressions-is-there-an-and-operator); `grepl("(?=.*h)(?=.*o)(?=.*u)", a, perl = TRUE)` – Henrik Nov 14 '18 at 23:43

3 Answers3

2

To match strings of length 4 containing the characters h, o, and u use:

grepl("(?=^.{4}$)(?=.*h)(?=.*o)(?=.*u)",
      c("hour","four","ruoh", "six", "high", "our"),
      perl = TRUE)
[1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
  • (?=^.{4}$): string has length 4.
  • (?=.*x): x occurs at any position in string.
Nairolf
  • 2,418
  • 20
  • 34
1

You could use strsplit and setdiff, I added an additional edge case to your sample data :

a<-c("hour","four","ruoh", "six", "high", "our","oouh")
a[nchar(a) == 4 &
  lengths(lapply(strsplit(a,""),function(x) setdiff(x, c("o","u","h")))) == 1]
# [1] "hour" "ruoh"

or grepl :

a[nchar(a) == 4 & !rowSums(sapply(c("o","u","h"), Negate(grepl), a))]
# [1] "hour" "ruoh" "oouh"

sapply(c("o","u","h"), Negate(grepl), a) gives you a matrix of which word doesn't contain each letter, then the rowSums acts like any applied by row, as it will be coerced to logical.

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
1

Using grepl with your edited method (r instead of h):

a<-c("hour","four","ruoh", "six", "high", "our")

a[grepl(pattern="o", x=a) & grepl(pattern="u", x=a) & grepl(pattern="r", x=a) & nchar(a)==4]

Returns:

[1] "hour" "four" "ruoh"
Cris
  • 787
  • 1
  • 5
  • 19