0

I am using RegEx for finding URL substrings in strings. The RegEx I am using has been taken from tohster's answer on - What's the cleanest way to extract URLs from a string using Python?

The RE is -

r'^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$'

I have done some changes to it -

  1. In the IPv4 detection part, I changed the order of the IP range to be found. > Precisely, changed [1-9]\d?|1\d\d|2[01]\d|22[0-3] to 25[0-5]|2[0-4][0-9]|1[0-> 9]{2}|[1-9][0-9]|[0-9] at 2 instances.
  2. Made the https group - (?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@) optional.

The final version is -

(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?

The final RE I am using seems to be very promising and has improved significantly as per my requirements(as compared to the original one) and works in Python as well as Java Script, except for the fact that due to the changes I have done have caused the following examples to give "catastrophic backtracking" error -

asasasasasac31.23.53.122asasassasd

12312312312321.32.34.2312312312321

12.3423423432.234123123.123

31.134232131.231.34

Can be tested at - https://regex101.com/r/i6jDei/1

My contention is that the first example - asasasasasac31.23.53.122asasassasd should have some slick way to pass as the IP is surrounded by non-numeric chars.

Also, is there a way to pass the first two of the above examples as valid IPv4 addresses?

To resolve ambiguity, I would opt for the largest possible Address, i.e.,

31.23.53.122

21.32.34.231

Community
  • 1
  • 1
user1412066
  • 155
  • 3
  • 10

1 Answers1

2

The issue of the catastrophic backtracking is caused by the pattern (?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})) where (?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+) will jump through a lot of combinations, if the overall pattern can not be matched. As you can see the character classes are basically the same, so e.g. for asasasasasac31 it can match like:

(asasasasasac31)
(a)(sasasasasac31)
(a)(s)(asasasasac31)
(as)(asasasasac31)

This is not really the way it actually takes, just to show how many combinations exist.

The mistake here seems to be the - being optional which I see no reason for. If we remove the -, we get it working for your samples (and reduce the number of steps for the already working samples).

See the updated regex101-demo, where I also added your samples that caused the catastrophic backtracking.

The final pattern then is:

(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])|(?:(?:[a-z\u00a1-\uffff0-9]+-)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?
Sebastian Proske
  • 8,255
  • 2
  • 28
  • 37
  • very well presented answer – msw Oct 15 '16 at 15:01
  • Thanks for the finding ! I have a lot of URLs with '-' characters. They are partially outputted and not the complete URL. See https://regex101.com/r/REu26j/1 . Is it a consequence of the change you made? What shall I do to get full URLs – user1412066 Oct 15 '16 at 19:43
  • 1
    @user1412066 if I input the first of your examples to your old pattern, the match is the same. For the second it shows catastrophic backtracking. – Sebastian Proske Oct 15 '16 at 19:51
  • 1
    It seems the double `-` is the issue, you might want to replace the two occurences of `-` in your pattern with `-+` – Sebastian Proske Oct 15 '16 at 19:53
  • Thanks ! I added [-_]+ at two places to get it working for "hyphens" and "underscores". Can we always find a way out of Cat. Backtracking? I think not. – user1412066 Oct 16 '16 at 11:22
  • 1
    @user1412066 Well this obviously depends on the input and what one actually wants to achieve, but I'd say that at least 99,9% of catastrophic backtracking can be solved. This might include a total rework of the used pattern. – Sebastian Proske Oct 16 '16 at 11:47