writing a regex which is working pretty well

Question

I have written a regex query to extract a URL. Sample text:

<p>a tinyurl here <a href="https://vvvconf.instenv.atl-test.space/x/HQAU">https://vvvconf.instenv.atl-test.space/x/HQAU</a></p>

URL I need to extract:

https://vvvconf.instenv.atl-test.space/x/HQAU

My regex attempts:

https:\/\/vvvconf.[a-z].*\/x\/[a-zA-z0-9]*

This extracts:

{"https://vvvconf.instenv.atl-test.space/x/HQAU">https://vvvconf.instenv.atl-test.space/x/HQAU"}

>https:\/\/vvvconf.[a-z].*\/x\/[a-zA-z0-9]*<

This extracts:

{">https://vvvconf.instenv.atl-test.space/x/HQAU<"}

How can I refine the regex so I just extract the URL https://vvvconf.instenv.atl-test.space/x/HQAU?

Tip: Parse HTML with an HTML parser. Using a regular expression is not the way to crack this nut. — tadman, Dec 09 '20 at 00:11
For the example data you could make the dot non greedy `.*?` but try using a parser if available. See https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — The fourth bird, Dec 09 '20 at 00:15
@tadman I need to do this as a part of extracting data from a database column, so I am not using HTML parser. — Vik G, Dec 09 '20 at 00:16
@VikG, if the database column contains HTML, why _not_ use an HTML parser? — Charles Duffy, Dec 09 '20 at 00:31

score 1 · Accepted Answer · answered Dec 09 '20 at 00:44

Depends if you want to extract the URL from the href attribute, or the text within the a tag. Assuming the latter you can use a positive lookbehind (if your regex flavor supports it)

const input = '<p>a tinyurl here <a href="https://vvvconf.instenv.atl-test.space/x/HQAU">https://vvvconf.instenv.atl-test.space/x/HQAU</a></p>';
const regex = /(?<=[>])https:[^<]*/;
console.log(input.match(regex))

Output:

[
  "https://vvvconf.instenv.atl-test.space/x/HQAU"
]

Explanation:

(?<=[>]) - positive lookbehind for > (expects but excludes >)
https: - expect start of URL (add as much as desired)
[^<]* - scan over everything that is not <

writing a regex which is working pretty well

1 Answers1