0

I am trying to use the URL extractor of Diego Perini in Python, but obviously, I have to change the format of Regex, because it doesn't return any HTML.

There are two versions on gist:

Javascript version

 /^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$/i

PHP version

_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS

Does any of them is ready to use for Python? Because noone works and returns links from text.

Sfinos
  • 279
  • 4
  • 15
  • Only problem looks like does Python support `UTF-16` uni-character construct, or something else, or not at all. Perl only supports `\x{}`, PCRE, Java, Dot-Net support `\u....` –  Apr 03 '14 at 15:21
  • Also, these regex's _require_ `(?:https?|ftp)` at the beginning of the string, or they fail right away. –  Apr 03 '14 at 15:59

1 Answers1

0

One way is to study 544 pages of Mastering Regular Expressions book and manually find out the differences between regex language flavors (which would be definitely a good idea if you have time and passion).

Another option is to use RegexBuddy tool that can automatically convert regular expressions between regex flavors:

You can convert from any regex flavor supported by RegexBuddy to any other regex flavor. RegexBuddy automatically substitutes equivalent syntax.

Also see:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195