0

I'm trying to write a regex for the following rules:

  1. Characters 1-3 must be numeric
  2. Character 4 must be ‘P’
  3. Character 5 must be alpha
  4. Characters 6-12 must be numeric
  5. Character 13 must be numeric or ‘X’

These make up an account's office reference for accountancy purposes. So far I have the following:

^\d{3}P[A-Z]{1}\d{7}$

To finish the regex, I just need to say "any single number OR letter X", but I am not quite sure how to go about it. I tried \d{1}[X], but it's expecting a digit AND a letter.

Any ideas?

J0e3gan
  • 8,740
  • 10
  • 53
  • 80
envio
  • 1,315
  • 2
  • 10
  • 13
  • what are you using RegEx when you could do the same thing using some built in methods for example `Contains()` method substring methods etc... sounds like you're taking the harder route in regards to if you understand RegEx or not.. – MethodMan Dec 19 '14 at 19:44
  • @DJKRAZE: Harder? For someone who knows regexps, this is totally straightforward. TMTOWDI of course, but I think "harder" hardly applies here. And working through more straightforward regexps like this rather than falling back to procedural conditional mechanisms...is precisely how to develop more regexp savvy. – J0e3gan Dec 19 '14 at 20:01
  • @J0e3gan for beginners regex is a complex subject. Also, even though this is a straight forward regex, it could possibly cause performance issues in future or may require additional revision to the software with the new updates. http://stackoverflow.com/questions/2962670/regex-ismatch-vs-string-contains – Hozikimaru Dec 19 '14 at 20:04
  • You know what `\d` is, you know what `[A-Z]` is but you don't know what `[\dX]` is ??? –  Dec 19 '14 at 20:32
  • Correct sln, hence why I asked the question. Remember, there are no stupid questions. – envio Dec 22 '14 at 18:11

2 Answers2

4

Try this:

^\d{3}P[A-Z]\d{7}[0-9X]$

The character group [0-9X] will match a single numeric character or X (unless an explicit quantifier other than {1} – e.g. {2} – follows it).

Addendum:

As @sln pointed out, it would be best to settle on 0-9 or \d (not mix the two) in a given regexp for consistency – in other words use...

^\d{3}P[A-Z]\d{7}[\dX]$

...or...

^[0-9]{3}P[A-Z]\d{7}[0-9X]$

...in this case.

Performance

Following comments regarding abysmal regexp performance, the concerns are greatly overstated.

Here is a quick sanity check...

void Main()
{
    // Quick sanity check.

    string str = "111PH1234567X";

    Stopwatch stopwatch = Stopwatch.StartNew();

    for (int i = 0; i < 1000000; i++)
    {
        if (str.Substring(0, 3).All(char.IsDigit)           //first 3 are digits
               && str[3] == 'P'                             //4th is P
               && char.IsLetter(str[4])                     //5th is a letter
               && str.Substring(5, 7).All(char.IsDigit)     //6-12 are digits 
               && char.IsDigit(str[12]) || str[12] == 'X')  //13 is a digit or X
       {
           ;
           //Console.WriteLine("good");
       }
    }

    Console.WriteLine(stopwatch.Elapsed);

    stopwatch = Stopwatch.StartNew();

    Regex regex = new Regex(@"^\d{3}P[A-Z]\d{7}[0-9X]$", RegexOptions.Compiled);
    for (int j = 0; j < 1000000; j++)
    {
        regex.IsMatch(str);
    }

    Console.WriteLine(stopwatch.Elapsed + " (regexp)");

    // A bit more rigorous sanity check.

    string[] strs = { "111PH1234567X", "grokfoobarbaz", "really, really, really, really long string that does not match", "345BA7654321Z" };

    Stopwatch stopwatch2 = Stopwatch.StartNew();

    for (int i = 0; i < strs.Length; i++)
    {
        for (int j = 0; j < 1000000; j++)
        {
            if (strs[i].Substring(0, 3).All(char.IsDigit)           //first 3 are digits
                && strs[i][3] == 'P'                                //4th is P
                && char.IsLetter(strs[i][4])                        //5th is a letter
                && strs[i].Substring(5, 7).All(char.IsDigit)        //6-12 are digits 
                && char.IsDigit(strs[i][12]) || strs[i][12] == 'X') //13 is a digit or X
            {
                ;
                //Console.WriteLine("good");
            }
        }
    }

    Console.WriteLine(stopwatch2.Elapsed);

    stopwatch2 = Stopwatch.StartNew();

    Regex regex2 = new Regex(@"^\d{3}P[A-Z]\d{7}[0-9X]$", RegexOptions.Compiled);
    for (int i = 0; i < strs.Length; i++)
    {
        for (int j = 0; j < 1000000; j++)
        {
            regex2.IsMatch(strs[i]);
        }
    }

    Console.WriteLine(stopwatch2.Elapsed + " (regexp)");
}

...that yields the following on my humble machine:

00:00:00.2134404
00:00:00.4527271 (regexp)
00:00:00.4872452
00:00:00.9534147 (regexp)

The regexp approach appears to be ~2x slower. As with anything, one needs to consider what makes sense for their use case, scale etc. Personally, I side with Donald Knuth, start with "premature optimization is the root of all evil", and would make a performance-driven choice only as needed.

J0e3gan
  • 8,740
  • 10
  • 53
  • 80
  • 1
    You should remove the `{1}`, which is ALWAYS extraneous. Also, you should either specify case-insensitive, or change to `[A-Za-z]`. – Brian Stephens Dec 19 '14 at 19:49
  • @BrianStephens: Good catch - exactly why I did not use one for the character group `[0-9X]`. I will update accordingly, as it is a little bonus improvement to go along with the OP's primary need. Also, it reminds me that explicitness of this sort is more common for regex newcomers - always good to provide reminders that this is a habit to drop as one develops more regex savvy. – J0e3gan Dec 19 '14 at 19:52
  • 1
    You shouldn't mix metaphors. If you use `\d`, always use it, don't use `[0-9X]`, use `[\dX]`. –  Dec 19 '14 at 20:36
  • @sln: I don't think it is a crucial point (or that we are dealing in metaphors), but, yes, it makes sense to go with either `[0-9]` or `\d` throughout a given regexp for consistency. – J0e3gan Dec 19 '14 at 20:57
  • Be careful of operator precedence. The `if(... && char.IsDigit(str[12]) || str[12] == 'X')` will be interpreted as `if((... && char.IsDigit(str[12])) || str[12] == 'X')` so you need to add braces, eg: `if(... && (char.IsDigit(str[12]) || str[12] == 'X'))`. – AdrianHHH Dec 26 '14 at 20:01
2

I prefer basic methods than Regex when I can.

This is a whitelist approach:

var str = "111PH1234567X";

if (str.Substring(0, 3).All(char.IsDigit)           //first 3 are digits
       && str[3] == 'P'                             //4th is P
       && char.IsLetter(str[4])                     //5th is a letter
       && str.Substring(5, 7).All(char.IsDigit)     //6-12 are digits 
       && char.IsDigit(str[12]) || str[12] == 'X')  //13 is a digit or X
   {
       Console.WriteLine("good");
   }

you may need to add a check for string length, depending on your conditions.

Running this 1 million times vs the regex approach shows it is, at worst (str is valid, every condition is checked), 4x faster. Just throwing that out there.

Jonesopolis
  • 25,034
  • 12
  • 68
  • 112
  • This has its place too, but the required regex couldn't be more straightforward: I would only take this route where regex comfort is nil, but TMTOWDI of course. See my contrasting opinion for a [trickier case in JavaScript](http://stackoverflow.com/a/27560973/1810429) - i.e. a case where I would wholeheartedly agree. – J0e3gan Dec 19 '14 at 19:57
  • I would check your performance metrics on that. Yes, regexps are typically slower for the tradeoff of more concise code, less to debug and maintain etcetera; but let's be accurate if performance is the question: my quick analysis shows "at worst...4x faster" to be quite an exaggeration (per the edit to my answer). Like I said, TMTOWDI, but my way is to not make performance arguments before it is clear they matter. – J0e3gan Dec 19 '14 at 20:48
  • Having said this, I +1'd this answer immediately, putting my money where my TMTOWDI is. :) – J0e3gan Dec 19 '14 at 21:05
  • Be careful of operator precedence. The `if(... && char.IsDigit(str[12]) || str[12] == 'X')` will be interpreted as `if((... && char.IsDigit(str[12])) || str[12] == 'X')` so you need to add braces, eg: `if(... && (char.IsDigit(str[12]) || str[12] == 'X'))`. – AdrianHHH Dec 26 '14 at 20:02