3

I'm trying to match a String input with the criteria below:

  1. The first characters are unique lowercase English letters
  2. The next characters are the represent the current year from 1500 to 2020
  3. The next characters can only be 10, or 100, or 1000
  4. The last character will be a digit 0 through 9

The regex string that I have created that I believe is mostly correct is with explanation is:

String validRegex = 
"^"+                                    # start of string
(?=.*[a-z].*[a-z].*[a-z])"+             # Ensure string has only 3 consecutive lowercase English letters
"(?=.*[0-9].*[0-9].*[0-9].*[0-9])"+     # Ensure string has only 4 digits representing year i.e. 2020
"(?=.*([0-9].*[0-9]) | ([0-9].*[0-9].*[0-9]) | ([0-9].*[0-9].*[0-9].*[0-9]))"+ # Ensure 10, 100, or 100 digits
"(?=.*[0-9])"+                          # Ensure last character is a digit 0-9
"(?=\\S+$)"+                             # Ensure string has no whitespace
".{10,12}"+                              # Entire string length must be from 10 through 12 characters
"$";                                     # end of string

Is there a simple way to update my regex expression such that I can detect for only unique consecutive characters?

ennth
  • 1,698
  • 5
  • 31
  • 63
  • 2
    Yes, use `(?=([a-z])(?!\\1)([a-z])(?!\\1|\\2)[a-z])` as the first lookahead after `^` – Wiktor Stribiżew Oct 28 '20 at 11:47
  • How would I make sure the YEAR, which is 4 digits [0-9] consectively, is values 1500 through 2020? would I have to parse out the GROUPS? – ennth Oct 28 '20 at 11:49
  • `(1[5-9][0-9]{2}|20[01][0-9]|2020)`? Use http://gamon.webfactional.com/regexnumericrangegenerator/ for that. – Wiktor Stribiżew Oct 28 '20 at 11:52
  • Your other requirements seem off, too. `(?=.*[a-z].*[a-z].*[a-z])` does not guarantee there are only 3 letters. – Wiktor Stribiżew Oct 28 '20 at 12:00
  • @ennth, do the _consecutive characters_ in the prefix mean that `abc|bcd|...|xyz` are valid only, while `abd`, `zab` are invalid? – Nowhere Man Oct 28 '20 at 12:30
  • At some point you'll have to ask yourself if regex are the right tool for the job. If your rules become so complex that no sane person will be able to understand them based on the regex, then maybe just writing plain old code iterating over characters becomes a viable alternative. – Joachim Sauer Oct 28 '20 at 13:13

2 Answers2

3

Look:

  • The entire input (String) length will be from 10 to 12 characters always - ^.{10,12}$ (HOWEVER, in this case, you do not need to add this to the overall pattern because all parts below will sum up to 10, 11 or 12 chars allowed in the string)
  • The first 3 characters are UNIQUE lowercase English letters ([a-z]) - ^([a-z])(?!\\1)([a-z])(?!\\1|\\2)[a-z]
  • The next 4 characters are the represent the current year from 1500 to 2020, i.e. 2020 - (?:1[5-9][0-9]{2}|20[01][0-9]|2020)
  • The next characters can only be 10, or 100, or 1000 only (so at minimum 2 chars (i.e. 10), or at max 4 chars (i.e. 1000)) - [0-9]{2,4}
  • The last character will be a digit 0 through 9 - [0-9].

Joining these bits, you get

String regex = "^([a-z])(?!\\1)([a-z])(?!\\1|\\2)[a-z](?:1[5-9][0-9]{2}|20[01][0-9]|2020)[0-9]{2,4}[0-9]$";

See the regex demo.

If you plan to support lower- and uppercase letter, add the case insensitive modifier (?i) at the start:

String regex = "(?i)^([a-z])(?!\\1)([a-z])(?!\\1|\\2)[a-z](?:1[5-9][0-9]{2}|20[01][0-9]|2020)[0-9]{2,4}[0-9]$";

If there can be a letter at the end, not just a digit, you may use

String regex = "(?i)^([a-z])(?!\\1)([a-z])(?!\\1|\\2)[a-z](?:1[5-9][0-9]{2}|20[01][0-9]|2020)[0-9]{2,4}[0-9a-z]$";

See this regex demo.

To create regex number ranges, you may use such well-known services as gamon.webfactional.com or richie-bendall.ml, or MyRegexTester.com.

See the Java demo:

String regex = "(?i)(([a-z])(?!\\2)([a-z])(?!\\2|\\3)[a-z])(1[5-9][0-9]{2}|20[01][0-9]|2020)([0-9]{2,4})([0-9a-z])";
String s = "AVG190420T";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
    System.out.println("Part 1: " + matcher.group(1));
    System.out.println("Part 2: " + matcher.group(4));
    System.out.println("Part 3: " + matcher.group(5));
    System.out.println("Part 4: " + matcher.group(6));
} else {
    System.out.println(s + " does not match the pattern.");
}

Output:

Part 1: AVG
Part 2: 1904
Part 3: 20
Part 4: T
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • If I wanted the year range from 1900 to 2019, its just (?:1[9][0-9]{2}|20[01][0-9]|2019) yes? – ennth Oct 28 '20 at 12:26
  • @ennth Use that gamon site, `(?:19[0-9][0-9]|20[10][0-9])` – Wiktor Stribiżew Oct 28 '20 at 12:28
  • And isn't the ^.{10,12}$ missing from your String regex variable at the beginning? And what why doesn't the [0-9]{2,4} have ( ) parenthesis for a group? What if I want to extract this group/value and do a calculation? – ennth Oct 28 '20 at 12:43
  • 1
    @ennth `^.{10,12}$` is not necessary since the pattern itself already matches 10 to 12 chars. `AVG190420T` returns false because the first letters are uppercase and the last char is a letter, not a digit (you wrote "*The last character will be a digit 0 through 9*"). If you need case insensitive regex, compile it with the `Pattern.CASE_INSENSITIVE` case insensitive flag, or add `(?i)` at the start. If you need to extract any part of the match, wrap the pattern part matching that bit with a pair of capturing parentheses. – Wiktor Stribiżew Oct 28 '20 at 12:51
  • @ennth If the string can end with a letter, replace `[0-9]$` with `[0-9A-Za-z]$` or simply `\\p{Alnum}$`. – Wiktor Stribiżew Oct 28 '20 at 12:55
  • Thank you sir. Is there a way to GET the value of the GROUP [0-9]{2,4}? I'm trying to use the matcher object to extract the GROUPS but getting weird output. I would like all the groups to be parseable, so I can extract their values and manipulate them. – ennth Oct 28 '20 at 13:03
  • @ennth I did not capture it. And you year pattern matches `2019`, see https://regex101.com/r/gwkBic/3 – Wiktor Stribiżew Oct 28 '20 at 13:30
  • 1
    Thanks for your help. Marked as best answer. – ennth Oct 29 '20 at 12:10
0

The following regexp does not use lookaheads but it seems to be validating better by the initial requirements:

^(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz)(1[5-9]\d{2}|20[0-1]\d|2020)10{1,3}\d$

Online demo

The 1st group (abc|bcd|...|xyz) validates unique consecutive lowercase letters.

The 2nd group validates year: (1[5-9]\d{2}|20[01]\d|2020) match year from 1500 to 2020

The remaining digital suffix is validated:

  • 10{1,3} match 10, 100 or 100
  • \d match the closing digit

Update
For the year range 1900..2019 the pattern is (19\d{2}|20[01]\d) For the digits like 10, 20, 50, 100, 200, 500, 1000, the pattern is (10{1,3}|[25]0{1,2})

Updated online demo

Nowhere Man
  • 19,170
  • 9
  • 17
  • 42