4

Yes, another Regex question. You're welcome ;-P

This is the first time I've written my own regex for some simple string validation in C#. I think I've got it working but as a learning exercise I was wondering if it could be improved and whether I have made any mistakes.

The strings will all look something like this:

T20160307.0001

Rules:

  • Begin with the letter T.
  • Date in the format YYYYMMDD.
  • A full stop.
  • Last 4 characters are always numeric. There should be exactly 4.

Here is my regex (fiddle):

^(?i)[T]20[0-9]{2}[0-1][0-9][0-3][0-9].\d{4}$

  • ^ Assert the start of the string.
  • (?i)[T] Check that we have a letter T, case insensitive.
  • 20 YYYY begins with 20 (I'll be dead by 2100 so I don't care about anything further :-P)
  • [0-9]{2} Any number between 0 and 99 for second part of YYYY.
  • [0-1][0-9] 0 or 1 for first part of month, 0-9 for second part of month.
  • [0-3][0-9] 0-3 for first part of day, 0-9 for second part of day.
  • . Full stop.
  • \d{4} 4 numerical characters.
  • $ Assert end of string.

One pitfall I can already see is date validation. 20161935 (the 35th day of the 19th month) is considered valid. I've read some / other / posts about achieving this which I believe match on number ranges but I was unable to understand the format.

I would accept an answer that simply solved the date issue if someone would be kind enough to ELI5 how this works, but other improvements would be a welcome bonus.

Edit: To avoid further confusion I should state that I know about DateTime.TryParse etc. As mentioned I'm using this as an opportunity to learn Regex and felt this was a good starting point. Sorry to anyone who's time I wasted, I should have made this clear in the original post.

Community
  • 1
  • 1
Equalsk
  • 7,954
  • 2
  • 41
  • 67
  • 2
    You have a good start. The dot needs to be escaped: `\.`, or it will match any character. I would suggest the following improvements: `(?i)T` -> `[Tt]` (it's shorter, and I'm not sure it `(?i)` is allowed inline); use consistently either `[0-9]` or `\d`. I would suggest to validate the date outside the regex, since leap year rules are complicated and your regex will get messy. – Heinzi Mar 07 '16 at 10:45
  • Why not use the DateTime.Tryparse and let this handle if the date is valid. simpler , and if youre format / requiremend changes more easy to adjust! Also how will you handle a leap year in youre regex? – lordkain Mar 07 '16 at 10:51
  • Related, for validating the date: [Wanted: DateTime.TryNew(year, month, day) or DateTime.IsValidDate(year, month, day)](http://stackoverflow.com/q/9467967/87698) – Heinzi Mar 07 '16 at 10:55
  • @Heinzi, that one's overkill, all you need is `DateTime.TryParseExact` with the `yyyyMMdd` format on the captured substring – Lucas Trzesniewski Mar 07 '16 at 10:56

3 Answers3

4

The things you can do are:

  • avoid the \d character class that matches all the unicode digits (since you only need the ascii digits)
  • instead of [0-1] you can write [01]
  • escape the dot to figure a literal dot (and not any characters)
  • no need to put T in a character class if it is the only character
  • eventually you can remove the inline modifier and use [Tt] in place of T


^(?i)T20[0-9]{2}[01][0-9][0-3][0-9]\.[0-9]{4}$

or

^[Tt]20[0-9]{2}[01][0-9][0-3][0-9]\.[0-9]{4}$

Other thing: do you really need to add extra checking for the date since you can't really test if the date is well formatted? (Think a minute about leap years) So why not:

^(?i)T(20[0-9]{6})\.[0-9]{4}$

and if you want to know if the date really exists, capture it and test it with DateTime.TryParse method.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • `\d` does only match ASCII if you enable ECMA: `var rex = new Regex(regex, RegexOptions.ECMAScript);` http://stackoverflow.com/a/16622773/360211 – weston Mar 07 '16 at 10:57
3

Why even use Regex just use the DateTime.TryParseExact Method. I'd implement it like so with extra checks for the other characters:

bool IsCorrectFormat(string input)
{
    //14 is a magic number for the length of the expected format
    if (input.Length == 14 && input.StartsWith("T", StringComparison.OrdinalIgnoreCase))
    {
        DateTime dt;
        if (DateTime.TryParseExact(input.Substring(1), "yyyyMMdd.ffff", CultureInfo.InvariantCulture, DateTimeStyles.None, out dt))
        {
            return true;
        }
    }

    return false;
}

I don't know if the format is correct but you could always substring from 1 to 6 to get the yyyyMMdd then check for a decimal point and numerics at the last 5 characters.

EDIT: If this must be done with regex

I have used this regex in the past. Note that it does not check for leap years

@"^(((0[1-9]{1}|[1-2][0-9]{1}|3[01]{1})(0[13578]{1}|1[12]{1}))" //For a 31 day month
+ @"|"
+ @"((0[1-9]{1}|[1-2][0-9]{1}|30)(0[469]{1}|11))" //For a 30 day month
+ @"|"
+ @"((0[1-9]{1}|1[0-9]{1}|2[0-8]{1})(02)))" //For a 28 day month(feb)
+ @"([0-9]{4})$"; //For the year
TheLethalCoder
  • 6,668
  • 6
  • 34
  • 69
  • The answers are all brilliant but I feel this one is the best fit for what I asked. Thanks all. – Equalsk Mar 08 '16 at 12:17
1

As mentioned I'm using this as an opportunity to learn Regex and felt this was a good starting point.

It's certainly not trivial to validate a date using a regular expression, particularly given the complex rules involved for leap years. But it is possible.

The below expression will match if a valid date is input in YYYYMMdd format:

(?=\p{IsBasicLatin}{8}) # ensures \d matches only 0-9
(?!0000)\d{4} # year any 4-digit year, except 00
(?:0[1-9]\d|1[012]) # month 01-12
(?: 
   # day logic for leap years
   (?:
      (!00)[012]\d # Days 01-29 (we exclude 2/29 later)
      | (?<!02)30  # Day 30 valid for all months except Feb
      | (?<=0[13578]|1[02])31 # Day 31 valid for some months
   )
   # Non-Leap-year logic.  Do not allow 2/29 if the provided year
   # is not a leap year.
   (?<!
     (?:
        [13579] # years ending with odd number are not leap years
        | [02468][26]|[13579][048] # years not divisible by 4
                                     # are not leap years (02, 06, 10, ...)
        | (?:[02468][\d-[048]]|[13579][\d-[26]])00 # years divisible by
                                                 # 100 are not leap years,
                                                 # unless divisible by 400

     )0229)
)

Compile with RegexOptions.IgnorePatternWhitespace. You can use ^T~\.\d{4}$ to match the full string in your example, replacing ~ with the above expression.

drf
  • 8,461
  • 32
  • 50