Regex to match standard and outlier input

Question

I've been using a regex for some time with jQuery validation to ensure my users input valid strings for drawing names. In the past week we've added the ability to use 3rd-party devices whose strings are somewhat different. I'm tasked with allowing those strings as well as the previous set as valid input. I send them to this validator adapted from this SO answer:

$.validator.addMethod("accept", function (value, element, param)
{
    return value.match(new RegExp("^" + param + "$"));
});

Notice I'm tacking on the ^ & $ characters.

Like this:

        drawingName: {
            required: true,
            accept: "[0-9]{4,5}[\\.?\\w*]{0,4}"
        },

The double backslashes escape the single backslashes for use in the validator. If you're testing in something like http://www.rubular.com/, you'll want to use single backslashes.

The previous set (for which I made generic representations, where "X" represents letters, "0" represents digits, and decimal points are what they are) consisted of these valid possibilities:

There are tens of thousands of these variants in my data, and changing them is not a possibility. The company is working on standardized nomenclature, but the devices we've sold to customers can come back for repair, maintenance and calibration for decades: so we can't ever get rid of the old system.

The new string variant looks like this:

XXX-0000
XXX-00000
XXX-000000

I've modified the original regex to accommodate the change: [\\w{3}-]*\\d{4,5}[\\.?\\w*]{0,4}

I've also tried \p{L}{3}?-?\d{4,6}\.?\w{0,2}\.?\w{0,2}, but that exhibits the same problem (see immediately below).

Both work, but in my testing I've noticed that they will allow a seemingly infinite number of extra characters to be tacked onto the end of the valid possibilities. (I'm pretty sure the old regex allowed the same type of bad input.)

So, to trap for the new string, I'd need to look for three letters followed by a dash, followed by four to six digits (something like this: [\w{3}-]?\d{4,6}? or [/p{L}{3}-]?\d{4,6}?)...and also accommodate the previous drawing names, four to five digits possibly followed by a single letter, digit or a single decimal point, possibly followed by a single letter or digit, possibly followed by a single digit or letter (confusing, huh?) – something like this: \d{4,5}[\.\w*]{0,4} I think the problem with this part lies in the asterisk following the w, but I'm not sure how to fix it or properly concatenate the two different parts of the regex together.

What I'm looking for is hopefully a single regex that'll allow me to screen for valid input with all of the string variants above, but block on invalid input. I know I can simply add another validation rule, and that might be what I have to do, but I wanted to see if it's possible to do it in a single regex.

Edit:

Here is the final solution as suggested by Lucas that I'm using in my code, modified somewhat to not use the \w as pointed out in his answer below:

(?:\\d{4,5}[0-9a-zA-Z]{0,2}(?:\\.[0-9a-zA-Z]{1,2})?|[a-zA-Z]{3}-\\d{6})

score 2 · Accepted Answer · answered Jan 14 '15 at 20:14

2

You can do something like this:

^(?:\d{4,5}\w{0,2}(?:\.\w{1,2})?|\w{3}-\d{6})$

Demo

I just used the alternative operator (|) to split between old and new formats.

Note that your original regex ([0-9]{4,5}[\.?\w*]{0,4}) probably has an issue: [\.?\w*] means . or ? or a word character or *, and it doesn't seem that's what you're after. I made it stricter per your examples, but you may have to adjust it.

Also, note that \w means [0-9a-zA-Z_] in JS - this may not be exactly what you're after (particularly the underscore).

answered Jan 14 '15 at 20:14

Lucas Trzesniewski

50,214
11
107
158

Thanks! I didn't consider being able to split the regex into two independent pieces. You're also correct, `\w` isn't really what I wanted, so I think I'll replace it with `[0-9a-zA-Z]`. Also thanks for the demo, I didn't know that site existed. I'll let you know what I work out. – delliottg Jan 14 '15 at 21:19

J0e3gan · Answer 2 · 2015-01-15T06:33:29.787

Single Regex

Here is a straightforward (relative to the scale of your problem) single regex that accounts for all the sample inputs provided – but does not "allow a seemingly infinite number of extra characters to be tacked onto the end of the valid possibilities":

^(\d{4}(\d([\dA-Z]|(\.(\d{1,2}|[A-Z]))|\d[A-Z]|[A-Z](\.\d[A-Z]?|[\dA-Z]\.\d))?|[A-Z]\.\d)|[A-Z]{3}-\d{4,6})$

Regular expression visualization

Debuggex Demo

Below is a snippet to test the regex with the sample inputs provided:

var regex = new RegExp("^(\\d{4}(\\d([\\dA-Z]|(\\.(\\d{1,2}|[A-Z]))|\\d[A-Z]|[A-Z](\\.\\d[A-Z]?|[\\dA-Z]\\.\\d))?|[A-Z]\\.\\d)|[A-Z]{3}-\\d{4,6})$");
// tests for the sample inputs provided that are all expected to match
var tests = [
    '12345',
    '12345.6',
    '12345.67',
    '12345.A',
    '123456',
    '123456A',
    '12345A',
    '12345A.6',
    '12345A.6B',
    '12345A6.7',
    '12345AB.6',
    '1234A.5',
    'ABC-1234',
    'ABC-12345',
    'ABC-123456'
];

for (var i = 0; i < tests.length; i++) {
    var result = "'" + tests[i] + "' => ";
                
    if (regex.test(tests[i])) {
        result += 'Yup!';
    } else {
        result += 'Nope...';
    }

    console.log(result);
}

Rationale

Clearly a top-level alternation (|) is the key to the old-variants/new-variant split; but I think that starting with \d{4,5} (particularly the {4,5} quantifier) for the old variants is the root of overmatching woes.

Instead start with what is common to all the old variants – four opening digits (i.e. \d{4}): this allows you to then straightforwardly use follow-on, inner alternations to match the old variants' divergences beyond their first four characters.

Details

This plays out as follows:

^ – start of the input string
( – start of the 1st-level alternation for old variants and the new variant
\d{4} – the four digits with which all old variants start
- ( – start of the 2nd-level alternation for old variants' primary divergences
- \d – a fifth digit
  - ( – start of 3rd-level alternations for optional divergences where the first five characters are digits
  - [\dA-Z] – a digit or letter
  - | – OR (3rd-level)
  - (\.(\d{1,2}|[A-Z])) – a period and 1) one or two digits OR 2) a letter
  - | – OR (3rd-level)
  - \d[A-Z] – a digit and a letter
  - | – OR (3rd-level)
  - [A-Z](\.\d[A-Z]?|[\dA-Z]\.\d) – a letter and 1) a period, a digit, and optionally a letter OR 2) a digit or letter, a period, and a digit
  - )? – end of 3rd-level alternations for optional divergences where the first five characters are digits
- | – OR (2nd-level)
- [A-Z]\.\d – a letter, a period, and a digit
- ) – end of the 2nd-level alternation for old variants' primary divergences
| – OR (1st-level)
[A-Z]{3}-\d{4,6} – the new variant
) – end of the 1st-level alternation for old variants and the new variant
$ - end of the input string

The Debuggex graphic above explains all this diagrammatically, but (some of) it may be easier to follow spelled out.

Maintainability

Using one regex to handle so many variants is naturally going to make maintainability a challenge.

To support maintainability:

Assign subpatterns like I have shown in the Details breakdown above to distinct string variables.
Then build up the single pattern by concatenating the subpatterns' strings.

This has the benefit of allowing you to comment each subpattern's string variable (like I have commented each bullet in the Details breakdown).

Thanks for the help, for the time being I'm going to try to go with Lucas' solution as it seems more easily maintained, but I'm going to try to dissect yours as well so I understand what's going on. Thanks for the demo site as well, I've used them a time or two before, but it was a long time ago and I'd forgotten about it. — delliottg, Jan 14 '15 at 21:21
Just read through your edited answer, thanks for pointing out the problem, I'll try to apply that to the solution above. Thanks for taking the time to explain it to me. — delliottg, Jan 14 '15 at 22:25
Wow, **great** expansion on the original answer. Thanks for the extra effort and taking the time to do so, I really appreciate it. — delliottg, Jan 14 '15 at 23:32

Regex to match standard and outlier input

2 Answers2

Single Regex

Rationale

Details

Maintainability