I am using RegEx for finding URL substrings in strings. The RegEx I am using has been taken from tohster's answer on - What's the cleanest way to extract URLs from a string using Python?
The RE is -
r'^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$'
I have done some changes to it -
- In the IPv4 detection part, I changed the order of the IP range to be found. > Precisely, changed
[1-9]\d?|1\d\d|2[01]\d|22[0-3]
to25[0-5]|2[0-4][0-9]|1[0-> 9]{2}|[1-9][0-9]|[0-9]
at 2 instances.- Made the https group -
(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)
optional.
The final version is -
(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?
The final RE I am using seems to be very promising and has improved significantly as per my requirements(as compared to the original one) and works in Python as well as Java Script, except for the fact that due to the changes I have done have caused the following examples to give "catastrophic backtracking"
error -
asasasasasac31.23.53.122asasassasd
12312312312321.32.34.2312312312321
12.3423423432.234123123.123
31.134232131.231.34
Can be tested at - https://regex101.com/r/i6jDei/1
My contention is that the first example - asasasasasac31.23.53.122asasassasd
should have some slick way to pass as the IP is surrounded by non-numeric chars.
Also, is there a way to pass the first two of the above examples as valid IPv4 addresses?
To resolve ambiguity, I would opt for the largest possible Address, i.e.,
31.23.53.122
21.32.34.231