2

I encountered a strange error when trying this regex line in pcre:

^(.*[ \-_])?(SS|SSN|SIN|SSIN|SSNSIN|((SOC(IAL)?[_\- ]?SEC(URITY)?|SOC)[_\- 
]?(DISABILITY)?[_\- ]?(INSURANCE)?(NUMBER|NUM|NO|NBR|NR)?))([ \-_].*)?(?<!
(CD|DT|F))$

The error message is: Your expression caused an unhandled error: lookbehind assertion is not fixed length - offset: 158

I tried to fix it with this but it didn't work:

^(.*[ \-_])?(SS|SSN|SIN|SSIN|SSNSIN|((SOC(IAL)?[_\- ]?SEC(URITY)?|SOC)[_\- 
]?(DISABILITY)?[_\- ]?(INSURANCE)?(NUMBER|NUM|NO|NBR|NR)?))([ \-_].*)?(?:(?
<!(CD|DT))|(?<!F))$

Please help!

Xin Jin
  • 65
  • 1
  • 6
  • I followed the solutions in https://stackoverflow.com/questions/3796436/whats-the-technical-reason-for-lookbehind-assertion-must-be-fixed-length-in-r but didn't work. I guess their case is ?< and my case is ?<!. Any comments? – Xin Jin Feb 05 '18 at 17:44
  • 4
    It's not a *strange error*, it says it right in the error message. Your lookbehind needs to be **fixed** length (not variable length). What this means is that your lookbehind `(?<!(CD|DT|F))` needs to either be 1 or 2 characters, but it can't be both. Currently, `(?<!(CD|DT|F))` is either of length 2 (`CD`, `DT`) **or** length 1 (`F`). You can't do this in PCRE. The only exception to this rule is when you're using 0-length assertions such as `^`, `$`, etc. – ctwheels Feb 05 '18 at 17:44
  • Thanks for your comment. I know the reason of this error but just don't know how to fix it. It must be done in pcre. Do you have any solutions? – Xin Jin Feb 05 '18 at 17:49
  • 1
    @XinJin can you provide us with a few sample strings? Doing so would allow us to provide the correct method, but you'll likely need to use a control verb like `(*FAIL)` with an if clause – ctwheels Feb 05 '18 at 17:50
  • sweaver2112, do you have any solution to fix it? – Xin Jin Feb 05 '18 at 17:51
  • @ctwheels, I tested it with 'SSN NUM_CD' in https://regex101.com/. I suppose this case would return a negative result because '_CD' in the end of the string would make this string unmatching the pattern. However, it still showed a full match. Anyway, I don't want to see any results ended with '_CD', '_DT' or '_F' – Xin Jin Feb 05 '18 at 17:56
  • 1
    @XinJin: you only have to remove the capture group: `(?<!(CD|DT|F))` => `(?<!CD|DT|F)` – Casimir et Hippolyte Feb 05 '18 at 17:57
  • You can only do this with either the newer `regex` module by `Python` or with `.NET` in general. – Jan Feb 05 '18 at 18:02

3 Answers3

6

Saying that a lookbehind must have a fixed length isn't entirely true with pcre. If you can't write things like (?<!ab*c) or (?<!(AB|BC|C)) or (?<!(AB|BC|CD)), you can write something like:

(?<!CD|DT|F)

A variable length lookbehind is allowed only if it contains an alternation (not enclosed in a group) where each branch of the alternation has a fixed length.

Conclusion, the problem in your lookbehind is the group, not the different number of characters between each branch.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
2

put your look-behinds one after another (AND), not in alternation (OR): (?<!F)(?<!CD)(?<!DT), like so:

^(.*[ \-_])?(SS|SSN|SIN|SSIN|SSNSIN|((SOC(IAL)?[_\- ]?SEC(URITY)?|SOC)[_\- ]?(DISABILITY)?[_\- ]?(INSURANCE)?(NUMBER|NUM|NO|NBR|NR)?))([ \-_].*)?(?<!F)(?<!CD)(?<!DT)$

Since look-arounds are "zero-width assertions" that don't move the current match position to the right at all, you can simply put them one after the other.

https://regex101.com/r/m95Jrs/1/

Scott Weaver
  • 7,192
  • 2
  • 31
  • 43
0

A lookbehind has to match a fixed-length string. regular-expression.info contains this explanation:

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. The regular expression engine needs to be able to figure out how many characters to step back before checking the lookbehind. When evaluating the lookbehind, the regex engine determines the length of the regex inside the lookbehind, steps back that many characters in the subject string, and then applies the regex inside the lookbehind from left to right just as it would with a normal regex.

With the pattern (CD|FT|F) it can't do this, because it doesn't know whether to go back 2 characters or 1.

The workaround is to do your check in two steps. Take the negative lookbehind out of the regexp. If you get a match, do an extra check to see if one of those patterns is at the end, and remove it from your result set.

Barmar
  • 741,623
  • 53
  • 500
  • 612