Regex for awk in multiline html script

Question

I just learned how to extract data with bash from html script like this:

<td>hello</td> <td>whatsup</td>

I can use awk -F '[<>]' '/<td>/,/<\/td>/ {print $3}' test.html

However how do I go about it, if it is separated with newlines like this?

<td> hello </td> <td> whatsup </td>

Going through tutorials the best code I could come up with, is this, which doesn't seem to work.

awk -F '\n' '/<td>/,/<\/td>/ {print $2}' test.html

I suggest to use an XML/HTML parser (xmllint, xmlstarlet ...). — Cyrus, Oct 23 '16 at 08:01
I'm trying to learn how to parse html myself without using any parsers. — John Smith, Oct 23 '16 at 08:04
See: [Don't use regex to parse HTML](http://stackoverflow.com/a/1732454/4060711) — Cyrus, Oct 23 '16 at 08:07
Okay, I will use a parser. I just thought this wasn't such a complex script for regex not to able to take it on. — John Smith, Oct 23 '16 at 08:14

score 1 · Answer 1 · answered Oct 23 '16 at 14:32

You learned wrong :-). Never use range expressions (/start/,/end/) as they make trivial jobs slightly briefer but then need a complete rewrite or duplicated conditions for anything even remotely interesting. Always use a flag instead (/start/{f=1} f; /end/{f=0}).

In this case, though, none of that is relevant because the right way to do what you want is with an XML parser and if you can't do that for some reason then you'd do this with GNU awk for multi-char RS:

awk -v RS='\\s*</td>' 'sub(/.*<td>\s*/,"")' file
hello
whatsup

Regex for awk in multiline html script

1 Answers1