2

Every time I need to use a regex I realize I've forgotten everything about them.

I am trying to match all words that have only lowercase alphanumeric characters AND do not have doubled alphanumeric characters AND are also within {10,12} characters long.

Now, to figure out if a character is followed by the same character, I would do (.)\1. To see if a word is within 10 and 12 characters I do {10,12}. To grab only lowercase letters and the digits, I do [0-9a-z].

But how do I link them together?

Cheers!

PS: this will be running on a fairly large NLP xml (100mb+), so I would appreciate it if the regex wasn't the slowest alternative.

Spectraljump
  • 4,189
  • 10
  • 40
  • 55

3 Answers3

3

I think this will do what you want: -

/\b(?:([a-z0-9])(?!\1)){10,12}\b/

Explanation: -

\b   // Word boundary
(?:
    ([a-z0-9])  // Match lowercase letters or digit
    (?!\1)      // Not followed by the same digit as before
){10,12}        // 10 to 12 times.
\b   // Word boundary
Rohit Jain
  • 209,639
  • 45
  • 409
  • 525
2

Here's one, although I'm not sure there won't be a better way...

/\b(?:([a-z0-9])(?!\1)){10,12}\b/
Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
1

Here is my attempt:

 (\b(?![0-9a-z]*([0-9a-z])\2)[0-9a-z]{10,12}\b)

(We have to use a lookahead, and some kind of boundary is usually very important for it to function properly. Hence \b).

At the time of writing, another answer has a false positive, matching a part of eoeuaoarounn

Anton Kovalenko
  • 20,999
  • 2
  • 37
  • 69