0

I have written a regex query to extract a URL. Sample text:

<p>a tinyurl here <a href="https://vvvconf.instenv.atl-test.space/x/HQAU">https://vvvconf.instenv.atl-test.space/x/HQAU</a></p>

URL I need to extract:

https://vvvconf.instenv.atl-test.space/x/HQAU

My regex attempts:

  1. https:\/\/vvvconf.[a-z].*\/x\/[a-zA-z0-9]*
    

    This extracts:

    {"https://vvvconf.instenv.atl-test.space/x/HQAU">https://vvvconf.instenv.atl-test.space/x/HQAU"}
    
  2. >https:\/\/vvvconf.[a-z].*\/x\/[a-zA-z0-9]*<
    

    This extracts:

    {">https://vvvconf.instenv.atl-test.space/x/HQAU<"}
    

How can I refine the regex so I just extract the URL https://vvvconf.instenv.atl-test.space/x/HQAU?

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Vik G
  • 539
  • 3
  • 8
  • 22

1 Answers1

1

Depends if you want to extract the URL from the href attribute, or the text within the a tag. Assuming the latter you can use a positive lookbehind (if your regex flavor supports it)

const input = '<p>a tinyurl here <a href="https://vvvconf.instenv.atl-test.space/x/HQAU">https://vvvconf.instenv.atl-test.space/x/HQAU</a></p>';
const regex = /(?<=[>])https:[^<]*/;
console.log(input.match(regex))

Output:

[
  "https://vvvconf.instenv.atl-test.space/x/HQAU"
]

Explanation:

  • (?<=[>]) - positive lookbehind for > (expects but excludes >)
  • https: - expect start of URL (add as much as desired)
  • [^<]* - scan over everything that is not <
Peter Thoeny
  • 7,379
  • 1
  • 10
  • 20