2

I have to construct a regex that matches client codes that look like:

  • XXX/X{3,6}
  • XXX.X{3,6}
  • XXX.X{3,6}/XXX

With X a number between 0 and 9.

The regex needs to be strong enough so we don't extract codes that are within another string. The use of word boundaries was my first idea. The regex looks like this: \b\d{3}[\.\/]\d{3,6}(?:\/\d{3})?\b

The problem with word boundaries is that it also matches dots. So a number like "123/456.12" would match "123/456" as the client number. So then I came up with the following regex: (?<!\S)\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?!\S). It uses lookbehind and lookahead and checks if that character is a white space. This matches most of the client codes correctly.

But there is still one last issue. We are using a Google OCR text to extract the codes from. This means that a valid code can be found in the text like 123/456\n, \n123/456, \n123/456\n, etc. Checking if the previous and or next characters are white space doesn't work because the literal "\n" is not included in this. If I do something like (?<!\S|\\n) as the word boundary it also includes a back and/or forward slash for some reason. Currently I came up with the following regex (?<![^\r\n\t\f\v n])\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?![^\r\n\t\f\v \\]), but that only checks if the previous character is a "n" or white space and the next a backslash or white space. So strings like "lorem\123/456" would still find a match. I need some way to include the "\n" in the white space characters without breaking the lookahead/lookbehind.

Do you guys have any idea how to solve this issue? All input is appreciated. Thx!

Ian
  • 43
  • 4
  • So for one thing, I recommend casting the string to raw string such as https://stackoverflow.com/questions/4415259/convert-regular-python-string-to-raw-string – FloLie May 31 '21 at 10:45
  • We can't do that because we have labeled data that uses the indexes of the string as a Span object. So if we cast to a raw string the labeled data won't be valid anymore. Refactoring the labeled data is no option either. – Ian May 31 '21 at 10:49
  • How does ```^(\\n){0,1}[0-9]{3}\/[0-9]{3,6}(\\n){0,1}$``` work? So line may or may not contain a new line caracter in the beginning and/or end, but defenitly nothing else in front or after? – FloLie May 31 '21 at 10:52
  • I think this should work: ```^(\\n){0,1}[0-9]{3}\/[0-9]{3,6}(\/[0-9]{3}){0,1}(\\n){0,1}$``` – FloLie May 31 '21 at 10:59

1 Answers1

2

It seems you want to subtract \n from the whitespace boundaries. You can use

re.findall(r'(?<![^\s\n])\d{3}[./]\d{3,6}(?:/\d{3})?(?![^\s\n])', text)

See the Python demo and this regex demo.

If the \n are combinations of \ and n chars, you need to make sure the \S in the lookarounds does not match those:

import re
text = r'Codes like 123/456\n \n123/3456 \n123/23456\n etc are correct \n333.3333/333\n'
print( re.findall(r'(?<!\S(?<!\\n))\d{3}[./]\d{3,6}(?:/\d{3})?(?!(?!\\n)\S)', text) )
# => ['123/456', '123/3456', '123/23456', '333.3333/333']

See this Python demo.

Details:

  • (?<![^\s\n]) - a negative lookbehind that matches a location that is not immediately preceded with a char other than whitespace and an LF char
  • (?<!\S(?<!\\n)) - a left whitespace boundary that does not trigger if the non-whitespace is the n from the \n char combination
  • \d{3} - theree digits
  • [./] - a . or /
  • \d{3,6} - three to six digits
  • (?:/\d{3})? - an optional sequence of / and three digits
  • (?![^\s\n]) - a negative lookahead that requires no char other than whitespace and LF immediately to the right of the current location.
  • (?!(?!\\n)\S) - a right whitespace boundary that does not trigger if the non-whitespace is the \ char followed with n.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks, but this still won't match when the string looks like this: "\n333.3333/333\n". Remember the "\n" is literal text here. – Ian May 31 '21 at 10:52
  • Using a lookahead/behind within the other to include the "\n" seems to work! Thx – Ian May 31 '21 at 11:01
  • @Ian Yes, because they are used to *restrict* the `\S` pattern inside the outer lookarounds. The outer lookarounds restrict the `\d{3}[./]\d{3,6}(?:/\d{3})?` matching context. – Wiktor Stribiżew May 31 '21 at 11:02