1

I want to grep the URL out of a .asx file. The file would normally look like this.

<ASX VERSION="3.0">
<ENTRY>
<TITLE>Blah Blah</TITLE>
<AUTHOR>Someone</AUTHOR>
<COPYRIGHT>(C)2014 Someone Else</COPYRIGHT>
<REF HREF="mms://www.example.com/video/FilmName/FilmName.wmv"/>
</ENTRY>
</ASX>

I want to get the URL without the quotes, and stripping off the mms://

I came up with a regex that uses lookarounds that does this successfully:

((?<=\/\/).*?).(?=\")

but of course I can't use this with grep. So what is another approach that would be flexible to capture whatever comes between the mms:// and the " that I could put into a grep -o command?

Joseph
  • 733
  • 1
  • 4
  • 20
  • Of course, [don't use regex to parse XML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – tripleee Jan 12 '14 at 17:08

4 Answers4

1

but of course I can't use this with grep.

Why not? Modern grep versions supports -P switch for PCRE regex support.

Try this:

grep -oP '((?<=//).*?).(?=")' file
www.example.com/video/FilmName/FilmName.wmv
Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
anubhava
  • 761,203
  • 64
  • 569
  • 643
0

Like this:

awk -F '[:"]' '/REF HREF/ {print substr($3,3)}' file
www.example.com/video/FilmName/FilmName.wmv
Jotne
  • 40,548
  • 12
  • 51
  • 55
0

With BASH, you can use the left/right pattern matching:

url='<REF HREF="mms://www.example.com/video/FilmName/FilmName.wmv"/>'
url=${url#<REF HREF=\"}
url=${url%\"/>}
echo "URL is '$url'"   # Prints URL is 'mms://www.example.com/video/FilmName/FilmName.wmv'

${VAR#pattern} strips off of $VAR the shortest left hand side glob that matches pattern. ${VAR##pattern}strips off of $var the largest left hand side glob that matches pattern. And, ${VAR%pattern} and ${VAR%pattern} do the same for the right hand side of $VAR.

An easy way to remember is that # is to the left of % on the keyboard. David Korn taught me that.

David W.
  • 105,218
  • 39
  • 216
  • 337
0

Solution for OSX users, where grep (as of OSX 10.9) doesn't support -P and look-arounds are therefore not an option:

egrep -o '"[a-z]+://[^"]+' file | cut -d '/' -f 3-
mklement0
  • 382,024
  • 64
  • 607
  • 775