4

YARP. (Yup, another regex problem).

Not sure the clearest way to describe this other than concrete examples.

Sample text:

  1. 4444 4444 4444 4444
  2. 4444444444444444
  3. 44 44 44 44 44 44 44 44
  4. 4444-4444-4444-4444
  5. 4444 (multiple spaces) 4444 (multiple spaces) 4444 (multiple spaces) 4444
  6. 0.4444444444444444
  7. 0.4444 4444 4444 4444

I need to build a regex that will match 1, 2 and 4 only. Requirements 13-16 digits, dashes and spaces optional, but only if single space, and no more than 3 total.

This is obviously CC info search related, and I've done a ton of research, found many examples that find matches for most, all or none, but nothing that will eliminate excessive false positives like 3 and 5 above. I'm using PowerGREP 5, I've read the entire tutorial on https://www.regular-expressions.info/tutorial.html and I can not figure out how to limit the number of optional whitespaces in the overall match. ie: "1 2 3 4 5 6 7 8 9" matches just as well as "123 456 789" if i make space(s) optional. Essentially, I want the regex to end match search if more than 3 spaces/dashes are detected.

Side note: I work for a company that deals with a TON of calendar data, so grepping a huge drive with many "1 2 3 4 5 6 7 8 ..." style text strings is generating a ton of false hits, even if I take time to tailor searches to CC inclusive patterns.

Any help would be super appreciated.

The closest I've found is:

\b(?:\d[ -]*?){13,16}\b

Which grabs any 13-16 digits (allowing for a dash or space in between) as expected, but it will also match "1 2 3 4 5 6 7 8 9 10 11" which is obviously not helpful.

All inclusive CC branded regex that fails to find valid numbers if they contain spaces/dashes: (but will find UK telephone numbers, heh):

\b(?:4[0-9]{12}(?:[0-9]{3})?|(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|6(?:011|5[0-9]{2})[0-9]{12}|(?:2131|1800|35\d{3})\d{11})\b

So then I tried replacing any [0-9] character class instances above with (?:\d[ -]*?) and that will find valid CCs with dashes/spaces, but it also matches all the "1 2 3 4 5 6 7 8 9 10 11" type false positives.

I am very new to regex, so if I'm committing a huge noob error, please feel free to point me in the right direction. Thank you!

Edit:

Replacing [0-9] with (?:\d[ -]?) for just the bigger consecutive string parts seems to be pretty close to what I need. Grepped same drive as before and only got 311 matches, and all 3 positive files found, I can live with just 308 false matches, but I gotta imagine there's a better way to do this still. And it's still matching strings of 13-16 digits with more than 3 delimiters...

Current regex:

\b(?:4(?:\d[ -]?){12}(?:[0-9]{3})?|(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)(?:\d[ -]?){12}|3[47](?:\d[ -]?){13}|3(?:0[0-5]|[68][0-9])(?:\d[ -]?){11}|6(?:011|5[0-9]{2})(?:\d[ -]?){12}|(?:2131|1800|35\d{3})(?:\d[ -]?){11})\b

1 Answers1

3

Since it looks like you want ever fourth digit to be followed by either a dash, a single space, or nothing, the simplest way would be to use

^(\d{4}[\s\-]?){3}\d{4}$

This would meet your written criteria, but allow a mixture like: 1234-5678 9012. If that's not acceptable, you can use a positive lookahead to validate that the pattern repeats the same

^(?=(\d{4}){3}|(\d{4}-){3}|(\d{4}\s){3})(\d{4}[\s-]?){3}\d{4}$

The first regex

  • Starts at the beginning of the string: ^
  • Finds four digits (0-9), optionally followed by space or dash, and repeats this pattern 3 times: (\d{4}[\s\-]?){3}
  • Then is followed by four more digits and the end of the string: \d{4}$

Taking just the lookahead from the second regex: (?=(\d{4}){3}|(\d{4}-){3}|(\d{4}\s){3})

  • Before the pattern starts to capture anything, we again start at the beginning of the string and look at the first three repeated patterns and ensures that the delimiter between is the same.

I see that in your example regex, you want to allow 13-16 digits and mine was specifically for 16. For 13-16 digits, you need to determine where you want those delimiters to be. Can they be anywhere, as long as there are only three of them and they don't repeat? I also see that you're using word boundaries, so I'm guessing that you're trying to match substrings. You can do that, but it'll be a little more difficult. Dashes and spaces are both word boundaries, so you might be get some false positives without some lookarounds.

As far as integrating into your CC regex, you're lazy matching an infinite number of dashes or spaces; you just want ? instead of *?. If you need more flexibility where those spaces/numbers go, while still limiting them then I'd probably use a negative regex to validate.

Gary
  • 13,303
  • 18
  • 49
  • 71
  • Few notes: not every 4th digit should be followed by space or dash. Certain older CC numbers have 13 digits, that obviously breaks the every 4th rule. Every CC number does have only up to 3 delimiters, which is why I'm looking for way to count the delimiters, not the position (if any) of them, and use that count to remove 95%+ of the false positives. Trying the regex above, it's matching numbers like 0001 by itself, which is not 13-16 digits. Not sure how to integrate this with useful regex, but the idea of lookahead / if then nesting seems promising. – Kenneth Barker Sep 24 '18 at 12:45