Regex to extract http links from an XML file

Question

I have an xml file with many lines like:

<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />

How do I extract just the link - http://store.vcenter.com/stores/en/product/tigers-midi/100?

I tried http://www\.\.com[^<]+ but that captures everything untill the end of the line - including quotes and closing XML tags.

I'm using this expression with egrep.

@baudsp Thanks for that. When I try on the command line `egrep -o "http[s]:\/\/.*\.com[^"]+"` and press enter I get `>` on a new line. Any idea why is that? — Test45, Feb 16 '18 at 17:29

Gilles Quénot · Accepted Answer · 2018-02-16T17:40:39.303

2

Don't parse HTML with regex, use a proper XML/HTML parser.

Check: Using regular expressions with HTML tags You can use one of the following :

File:

<root>
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
</root>

Example with xmllint :

xmllint --xpath '//*[@vip="true"]/@href' file.xml 2>/dev/null

Output:

 href="http://store.vcenter.com/stores/en/product/tigers-midi/100"

If you need a quick & dirty one time command, you can do:

egrep -o 'https?://[^"]+' file

edited Feb 16 '18 at 17:40

answered Feb 16 '18 at 17:22

Gilles Quénot

I'm getting `warning: failed to load external entity "/@href"` – Test45 Feb 16 '18 at 17:32
Added grep solution if it's a one time usage. If you want to use a parser correctly and having trouble, better post the input file – Gilles Quénot Feb 16 '18 at 17:34
Thanks for your effort. I'll play with these options and see if I can make it work. the grep command somehow breaks my terminal emulator - enter doesn't execute the command, but opens a new line... – Test45 Feb 16 '18 at 17:39
Looks your term is broken, seems nothing to do with the grep. Try `reset` command in term – Gilles Quénot Feb 16 '18 at 17:42

1 Answers1