9

I want to perform the title-named action under linux command-line(several ca bash script will also do). the command I tried is:

sed 's/href="([^"])"/$1/g' page.html > list.lst

but obviously it failed.

To be precise, here is my input:

<link rel="stylesheet" type="text/css" href="style/css/colors.css" />
<link rel="stylesheet" type="text/css" href="style/css/global.css" />
<link rel="stylesheet" type="text/css" href="style/css/icons.css" />

the output I want would be a comma-separated or space-separated list of all matches in the input file:

style/css/colors.css,style/css/global.css,style/css/icons.css

I think I got the right expression: href="([^"]*)"

but I have no clue how to perform this. sed would do a search/replace which is not exactly what I want.( to the contrary, I only need to keep matches and throw the rest away, and not to replace them )

BiAiB
  • 12,932
  • 10
  • 43
  • 63
  • I've created a bash function to do this using gawk, take a look at http://stackoverflow.com/a/14085682/162337 – opsb Dec 29 '12 at 20:33

1 Answers1

8
grep href page.html | sed 's/^.*href="\([^"]*\)".*$/\1/' | xargs | sed 's/ /,/g'

This will extract all the lines that contain href in them and will only get the first href on each line. Also, refer to this post about parsing HTML with regular expressions.

Community
  • 1
  • 1
rid
  • 61,078
  • 31
  • 152
  • 193
  • This just work great, thanks! As for the warning about parsing-HTML-with-regular-expressions, the files in input won't hold anymore things that these link elements, so it'll be ok I guess. I'll just put a warning about probable devilish corruption during use of the script. – BiAiB Jul 26 '11 at 15:09
  • @BiAiB, there are numerous things that can go wrong with parsing HTML with regex, such as using `'` instead of `"` for attributes (or not using quotes at all), using spaces between `href` and `=`, putting `href` on a new line, and many others. So if you're not absolutely sure that the HTML will look _exactly_ like that, it's probably a bad idea. – rid Jul 26 '11 at 15:12
  • or simply a commented link node. Btw I'm not sure single quotes are valid in XHTML. For now i'll use that cause it's simple. When the time'll come, it will be easy to replace. – BiAiB Jul 26 '11 at 15:33