0

I would like to match all strings that start with 1 to 4 (lower case) letters followed by 1 to 4 digits and the overall length of that sequence (letters + digits) should be 5. The letters and digits must not intermingle. The actual string however is much longer and this 5-sequence is not followed by any distinct word boundary (it can be followed by [a-z0-9] for example). The regex in question however should only be concerned with the first 5 characters.

For example:

  • Positive matches: a1111, aa111, abc12def, abc12345, ...
  • Negative matches: a1a1a, aa11a, aa11, 1aaaa x, ...

So I would need something like ^([a-z]{1,4})[0-9]{5 - length of \1}.

This question seems to be slightly related but I couldn't figure out how to make the length of the second group dependent on the first. This answer suggests to perform a lookahead on all the possible characters but doesn't prevent intermingling.

I don't want to perform a match on only the first five characters of the string (and then check the length of the actual match), since I would like to augment this regex in order to match the remainder of the string with some other pattern.

The length of the groups is small for the sake of the example but they are actually much longer (so manually specifying the various combinations is not an option; auto-generating a regex that contains all the combinations makes me worry about performance).

Specifically I am using Python 3.6 but I am happy about solutions considering other regex flavors as well.

a_guest
  • 34,165
  • 12
  • 64
  • 118
  • 1
    I believe you have a typo: `aaa11b` should not be a positive match by your rules. Should it be `aaa11`? According to your rules `\b(?=[a-z\d]{5}\b)[a-z]{1,4}\d{1,4}\b` should work. – ctwheels Oct 09 '18 at 15:54
  • @ctwheels It's not a typo, this string should be a match. The regex should only be concerned with the first five characters of the string. These five characters must be a sequence of 1-4 letters followed by 1-4 digits (overall length 5). This 5-sequence can be followed by another letter or digit. You can think of it as an identifier that consists of `[a-z0-9]` and I want to match all ids that start with the above described pattern (e.g. `abc12def` should match). I have further criteria that should apply to the remaining characters hence my question if it can be done with a single regex/match. – a_guest Oct 09 '18 at 22:01
  • I’ve edited my answer accordingly – ctwheels Oct 09 '18 at 22:50

2 Answers2

2

You can use the following method to cheat having to do alternations.

See regex in use here

\b[a-z]{1,4}\d{1,4}(?<=\b[a-z\d]{5})
  • \b Assert position at a word boundary
  • [a-z]{1,4} Matches a lowercase letter between 1 and 4 times
  • \d{1,4} Matches a digit between 1 and 4 times
  • (?<=\b[a-z\d]{5}) Positive lookbehind ensuring a combination of exactly 5 lowercase letters and digits precedes
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • That's it, using the beginning of the string as a natural boundary together with lookbehind does the trick. Thanks. – a_guest Oct 10 '18 at 07:55
  • If you add word boundary to the lookbehind it would capture only the first 5 characters of things like `abc12345`: `\b[a-z]{1,4}[0-9]{1,4}(?<=\b\w{5})` – ayorgo Oct 10 '18 at 09:03
  • @ayorgo the word boundary is not needed in the lookbehind since it’s at the beginning of the pattern. The lookbehind won’t be reached unless the word boundary exists. Adding it in the lookbehind simply adds steps. – ctwheels Oct 10 '18 at 11:36
  • @ctwheels OK, since it's become clear that the OP doesn't care about matching _only_ the first 5 characters, I agree. – ayorgo Oct 10 '18 at 12:05
  • 1
    @ayorgo thank you, I’ve added your edit to prevent confusion for future viewers – ctwheels Oct 10 '18 at 12:32
1

Regex cannot count, you need to use alternations like this:

\b([a-z][0-9]{4}|[a-z]{2}[0-9]{3}|[a-z]{3}[0-9]{2}|[a-z]{4}[0-9])\b

Regex Demo

wp78de
  • 18,207
  • 7
  • 43
  • 71
  • This doesn't match `aaa11b` or `abc12def` for example (see my updated question or the comments for further explanation). Also, as I mentioned in my question, the brevity of the sequences is for the scope of the example, the real sequences are much longer (1000+ characters, but can be even longer than that, no real upper boundary). Hence "manual" alternation is not possible and even for auto-generated alternations I'm worried about performance. – a_guest Oct 09 '18 at 22:07