4

The following is the error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/re.py", line 194, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

This is my object:

>>> re101121=re.compile("""(?i)激[ _]{0,}活[ _]{0,}邮[ _]{0,}箱|(click|clicking)[ _]{1,}[here ]{0,1}to[ _]{1,}verify|stop[ _]{1,}mail[ _]{1,}.{1,16}[ _]{1,}here|(click|clicking|view|update)([ _-]{1,}|\\xc2\\xa0)(on|here|Validate)[^a-z0-9]{1}|(點|点)[ _]{0,}(擊|击)[ _]{0,}(這|这|以)[ _]{0,}(裡|里|下)|DHL[ _]{1,}international|DHL[ _]{1,}Customer[ _]{1,}Service|Online[ _]{1,}Banking|更[ _]{0,}新[ _]{0,}您[ _]{0,}的[ _]{0,}(帐|账)[ _]{0,}户|CONFIRM[ _]{1,}ACCOUNT[ _]{1,}NOW|avoid[ _]{1,}Account[ _]{1,}malfunction|confirm[ _]{1,}this[ _]{1,}request|verify your account IP|Continue to Account security|继[\\s-_]*续[\\s-_]*使[\\s-_]*用|崩[\\s-_]*溃[\\s-_]*信[\\s-_]*息|shipment[\\s]+confirmation|will be shutdown in [0-9]{0,} (hours|days)|DHL Account|保[ ]{0,}留[ ]{0,}密[ ]{0,}码|(Password|password|PASSWORD).*(expired|expiring)|login.*email.*password.*confirm|[0-9]{0,} messages were quarantined|由于.*错误(的)?(送货)?信息|confirm.*(same)? password|keep.*account secure|settings below|loss.*(email|messages)|simply login|quick verification now""")
ggorlen
  • 44,755
  • 7
  • 76
  • 106
XiaoTian
  • 45
  • 5
  • Welcome to SO! This code works for me in Python 2.7, which it appears you're using from your error (I took the liberty of tagging it to avoid confusion with 3). Can you show a [mcve]? Thanks. As an aside, `{0,}` could be simply `*` and always use raw strings with regex, like `r"... stuff ..."`. – ggorlen Apr 22 '21 at 03:01
  • when I delete some rules so that it didn't look tha long, I found that the error disappeared. I didn't understand whether it was because the rules were too long or because there were some illegal sentences in the rules – XiaoTian Apr 22 '21 at 03:04
  • Probably the latter. Please show your full failing example or there's not much I can offer here. If the string is too large, you can binary search it to find the minimal failing pattern (or, better yet, please do that anyway even if it's not that large, so the problem is isolated). BTW, I used `# -*- coding: utf-8 -*-` when I tried to reproduce this. – ggorlen Apr 22 '21 at 03:05
  • OK, thanks for your comment, and here is my full example: – XiaoTian Apr 22 '21 at 03:09
  • Sorry, the example is too large to be added to comment area, I have put it in the question – XiaoTian Apr 22 '21 at 03:13
  • The rule match of my company, and I need to use it to try to find out what fields some file match. So... – XiaoTian Apr 22 '21 at 03:18

1 Answers1

0

After minimization, your error boils down to re.compile("""[\\s-_]"""). This is a bad character range indeed; you probably meant the dash to be literal re.compile(r"[\s\-_]") (always use raw strings for regex r"..."). Moving the dash to the end of the bracket group works too: r"[\s_-]".

In the future, try to binary search to find the minimal failing input: remove the right half of the regex. If it still fails, the problem must have been in the left half. Remove the right half of the remaining substring and repeat until you're down to a minimal failing case. This technique doesn't always work when the problem spans both halves, but it can't hurt to try.

As mentioned in the comments, it's pretty odd to have such a massive regex as this, but I'll assume you know what you're doing.

As another aside, there are some antipatterns in this regex (pardon the pun) like {0,} which can be simplified to *.

ggorlen
  • 44,755
  • 7
  • 76
  • 106