-2

I just learned how to extract data with bash from html script like this:

<td>hello</td> <td>whatsup</td>

I can use awk -F '[<>]' '/<td>/,/<\/td>/ {print $3}' test.html

However how do I go about it, if it is separated with newlines like this?

<td> hello </td> <td> whatsup </td>

Going through tutorials the best code I could come up with, is this, which doesn't seem to work.

awk -F '\n' '/<td>/,/<\/td>/ {print $2}' test.html

John Smith
  • 33
  • 5

1 Answers1

1

You learned wrong :-). Never use range expressions (/start/,/end/) as they make trivial jobs slightly briefer but then need a complete rewrite or duplicated conditions for anything even remotely interesting. Always use a flag instead (/start/{f=1} f; /end/{f=0}).

In this case, though, none of that is relevant because the right way to do what you want is with an XML parser and if you can't do that for some reason then you'd do this with GNU awk for multi-char RS:

awk -v RS='\\s*</td>' 'sub(/.*<td>\s*/,"")' file
hello
whatsup
Ed Morton
  • 188,023
  • 17
  • 78
  • 185