0

I have some html that I would like to pull a URL from using grep. Is there an elegant way to do this? So far, I'm using wget to dump the html into a tmp.html file. Then, this is what I'm doing:

awk '/<a href=/,/<\/a\>/' tmp.html | grep -v "sha1|md5" |grep -E "*.rpm?" | tail -1

Given a list of the following types of string, I'd like to pull out only the last .rpm URL on the list.

<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>
user3270760
  • 1,444
  • 5
  • 23
  • 45
  • Why do you want to use grep? You're already using awk and there's nothing grep can do that awk can't. Post a few lines of sample input and expected output so we can show you how to do it. – Ed Morton Feb 27 '15 at 19:14

2 Answers2

2

Using GNU awk for the 3rd arg to match() and given this input file:

$ cat file
<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>

This might be what you want:

$ cat tst.awk         
match($0,/<a href=.*>(.*\.rpm)<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}

$ gawk -f tst.awk file
something-0.0.1-20150227.161014-81-sles11_64.rpm

or this:

$ cat tst.awk
match($0,/<a href="([^"]+\.rpm)".*<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}

$ gawk -f tst.awk file
http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm

but without more sample input and the expected output it's a guess.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

The -o option causes grep to print out only the matches, instead of the full line which matches. If there is more than one match in a line, all of them will be printed.

*.rpm? is not a regular expression. If you want to make the match meaningful, you'll need to be quite precise; possibly something like

grep -o '"[^"]*.rpm"'

will give you more or less what you are looking for (but it will output the quotes as well, and will not deal with %-escapes in the URL.

You could probably do better with awk, since you are using that anyway.

Parsing HTML with regular expressions is never going to be as robust nor as easy as using a real HTML parser, as has been observed frequently here.

Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341