0

Here is my current regex: (?:ht|f)tps?:[\S]*\/?(?:\w+)

I need to refine it such that it pulls the following link correctly from the quoted text below: http://www.purdue.edu/transcom/index.php

Any thoughts on how I can improve my current regex? Thanks in advance!

Additional information about the experimental protocol and results is provided in the companion files and the TransCom project web site (http://www.purdue.edu/transcom/index.php).The results of the Level 1 experiments presented here are grouped into two broad categories

Alfabravo
  • 7,493
  • 6
  • 46
  • 82
  • 1
    I suppose you want to find any link, not just this specific one, right? –  Nov 06 '18 at 15:30
  • Try this: [`\(((?:ht|f)tps?:[\S]*\/?(?:\w+))\)` and retrieve group 1](https://regex101.com/r/nSNurX/1). – r.ook Nov 06 '18 at 15:31
  • @eukaryota correct, find any link embedded in any text :) – coding_patty Nov 06 '18 at 15:39
  • @Idlehands unfortunately that pulls with the parentheses. i'll be doing a response code test on the link, so i would need the entire link without the parentheses. – coding_patty Nov 06 '18 at 15:42
  • As you can see from StackOverflow's failed attempt to parse your URL, doing so isn't easy... – Aaron Nov 06 '18 at 15:42
  • @coding_patty that's why I said retrieve group 1... it's matching the link within the parentheses and then returning the result *without* the parentheses. – r.ook Nov 06 '18 at 15:43
  • The first answer at the duplicate is able to properly parse your example. – Mark Ransom Nov 06 '18 at 16:29

2 Answers2

0

I do not tested your regex thougoutly, and this is not clear enough why is your current regex failing. But to catch a ulr in general, I would use the repetition of the group (the authorized characters for html minus the slash like [a-zA-Z0-9.]) and the slash) something like

r'(?:ht|f)tps?:\\(?:\\[_html_authorized_chars])*'

and eventually a positive lookahead assertion if the answer is always inside quotes or parenthesis...

0

Url Similar Splitter

matches url similars and splits it into its address and parameters

by deme72

([--:\w?@%&+~#=]*\.[a-z]{2,4}\/{0,2})((?:[?&](?:\w+)=(?:\w+))+|[--:\w?@%&+~#=]+)? Source: regexr.com community

Jerome
  • 17
  • 1
  • 10