0

I have an xml file with many lines like:

<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />

How do I extract just the link - http://store.vcenter.com/stores/en/product/tigers-midi/100?

I tried http://www\.\.com[^<]+ but that captures everything untill the end of the line - including quotes and closing XML tags.

I'm using this expression with egrep.

Test45
  • 563
  • 1
  • 5
  • 12

1 Answers1

2

Don't parse HTML with , use a proper XML/HTML parser.

Check: Using regular expressions with HTML tags You can use one of the following :

  • xmllint
  • xmlstarlet
  • saxon-lint

File:

<root>
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
</root>

Example with xmllint :

xmllint --xpath '//*[@vip="true"]/@href' file.xml 2>/dev/null

Output:

 href="http://store.vcenter.com/stores/en/product/tigers-midi/100"

If you need a quick & dirty one time command, you can do:

egrep -o 'https?://[^"]+' file
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • I'm getting `warning: failed to load external entity "/@href"` – Test45 Feb 16 '18 at 17:32
  • Added grep solution if it's a one time usage. If you want to use a parser correctly and having trouble, better post the input file – Gilles Quénot Feb 16 '18 at 17:34
  • Thanks for your effort. I'll play with these options and see if I can make it work. the grep command somehow breaks my terminal emulator - enter doesn't execute the command, but opens a new line... – Test45 Feb 16 '18 at 17:39
  • Looks your term is broken, seems nothing to do with the grep. Try `reset` command in term – Gilles Quénot Feb 16 '18 at 17:42