2

I'm trying to create a grepl regex in R to match strings that:

  1. Contain 1 or more lowercase letters
  2. Contain 1 or more numbers
  3. Only allow lowercase letters (a-z) or numbers, i.e. no spaces, no special characters, no other punctuation
  4. The string must be exactly 8 characters long

However, my attempt so far doesn't yield any luck:

grepl("((?=.*[[:lower:]])(?=.*[[:digit:]])[[:alpha:]]{8})", x, perl=TRUE)

Any ideas where I'm going wrong?

Examples of inclusion cases would be: xxxxxxx8, 1234567x, ab12ef78

Examples of exclusion cases would be: x!3d5f78, x23456789, Ab123456

SimonSchus
  • 65
  • 10

2 Answers2

4

You're very close, you have the key concepts right (mainly forward lookahead). You could use this:

grepl("((?=.*[[:lower:]])(?=.*[[:digit:]])[[:lower:][:digit:]]{8})", x, perl=TRUE)

Personally, I don't find it much more readable to use named character classes, so I'd write it like this:

grepl("^(?=.*[a-z])(?=.*\\d)[a-z\\d]{8}$", x, perl=TRUE)

I also removed the outer parens (not necessary) and anchored the beginning & end.

Here are the results on your example inputs:

x <- c("xxxxxxx8", "1234567x", "ab12ef78", "x!3d5f78", "x23456789", "Ab123456")

grepl("^(?=.*[a-z])(?=.*\\d)[a-z\\d]{8}$", x, perl=TRUE)
# [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE
Ken Williams
  • 22,756
  • 10
  • 85
  • 147
1

You could also manage with very simple regex by breaking up your test:

grepl("[a-z]", x) & # Contain 1 or more lowercase letters
  grepl("\\d", x) & # Contain 1 or more numbers
  !grepl("[A-Z]|\\s|\\p{P}|\\p{S}", x, perl = TRUE) & # no upper, space, punctuation nor special char.
  nchar(x) == 8L # is 8 characters

[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • 1
    That's true, and it would allow turning the individual criteria on and off separately. It'll be slower than one big regex, though, because the regex engine can make matching extremely efficient. And it's also possible to build up regexes from chunks and paste them together, so that's an alternative halfway between single-regex and many-regex. – Ken Williams Aug 14 '18 at 15:15