0

I'm trying to write a small bash script that:

  • -wget's an html file every [x] minutes from the web
  • -uses some linux utility to find differences in the file between the last two updates
  • -Uses sed to modify the lines on which new text was detected

The problem I am running into is that the HTML file uses in-line CSS to format a table, but the actual code for the page is stored on one long line.

Effectively I need a Linux utility that can scan through a single line of code, find every instance of text between each tags, and insert those instances on their own line. That should make scanning the text easier. Every tool I've tried searches on a per-line basis which can't do what I need since the entire code is stored on a single line.

1 Answers1

1

You could first split the content into lines, by substituting (say) > with >\n. That will break up the document on the end of each HTML tag.

Maybe you don't even need to do that: if you use awk's RS variable to define the record separator as ">" instead of newline. See this page for an example of using RS: http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • I'm looking at the RS variable now. As for your first example, should I use sed to modify each "" tag with "\n"? – user2057895 Feb 10 '13 at 01:00
  • some text If you set RS to ">" you'll get , some text, , Three records, from one line. However, if your text can contain ">", it'll pickle things a little. – Bill Woodger Feb 10 '13 at 01:09
  • Taking John's advice, I tried sed -i 's/<\/tr>/<\/tr>\n/g' file.html This did the trick! Regular expressions are confusing. – user2057895 Feb 10 '13 at 01:20
  • Yes, for example you could use sed to add newlines after each closing tag you're interested in. Note that most versions of sed do not make this particularly easy, so see this other answer for how to do that: http://stackoverflow.com/questions/6111679/insert-linefeed-in-sed – John Zwinck Feb 10 '13 at 01:20
  • Regarding that sed expression: you can use other characters than slash to delimit sed commands (the first one seen will set what sed expects for all the rest of the delimiters, so you can use anything!). So you may find this more readable: `s@@\n@g`. – John Zwinck Feb 10 '13 at 01:23