awk - output

Question

I have the following html

<table id="statTableHTML" class="table hidden"><thead><tr><th>Characteristic</th><th>GCI Score</th><th>Legal</th><th>Technical</th><th>Organizational</th><th>Capacity Building</th><th>Cooperation</th></tr></thead><tbody><tr><td>United Kingdom</td><td>0.93</td><td>0.2</td><td>0.19</td><td>0.2</td><td>0.19</td><td>0.15</td></tr><tr><td>United States</td><td>0.93</td><td>0.2</td><td>0.18</td><td>0.2</td><td>0.19</td><td>0.15</td></tr><tr><td>France</td><td>0.92</td><td>0.2</td><td>0.19</td><td>0.2</td><td>0.19</td><td>0.14</td></tr><tr><td>Lithuania</td><td>0.91</td><td>0.2</td><td>0.17</td><td>0.2</td><td>0.19</td><td>0.16</td></tr><tr><td>Estonia</td><td>0.91</td><td>0.2</td><td>0.2</td><td>0.19</td><td>0.17</td><td>0.15</td></tr><tr><td>Singapore</td><td>0.9</td><td>0.2</td><td>0.19</td><td>0.19</td><td>0.2</td><td>0.13</td></tr><tr><td>Spain</td><td>0.9</td><td>0.2</td><td>0.18</td><td>0.2</td><td>0.17</td><td>0.15</td></tr><tr><td>Malaysia</td><td>0.89</td><td>0.18</td><td>0.2</td><td>0.2</td><td>0.2</td><td>0.12</td></tr><tr><td>Norway</td><td>0.89</td><td>0.19</td><td>0.2</td><td>0.18</td><td>0.19</td><td>0.14</td></tr><tr><td>Canada</td><td>0.89</td><td>0.2</td><td>0.19</td><td>0.2</td><td>0.17</td><td>0.14</td></tr><tr><td>Australia</td><td>0.89</td><td>0.2</td><td>0.17</td><td>0.2</td><td>0.18</td><td>0.14</td></tr></tbody></table>

I would like to get all the values within the tr, td elements.

I have the following awk -F '</*td>|</*tr>' '{print $5, $7, $9, $11, $13, $15, $17}'

however it only outputs the first row United Kingdom 0.93 0.2 0.19 0.2 0.19 0.15

how can i write this so I get all the values of the rows?

thanks

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) — Cyrus, May 31 '21 at 04:10

Ed Morton · Answer 1 · 2021-05-31T01:37:02.320

With GNU awk for multi-char RS:

$ awk -v RS='<tr><td>|</td></tr>' -F'(</?td>)+' -v OFS='\t' 'NF>1{$1=$1; print}' file
United Kingdom  0.93    0.2     0.19    0.2     0.19    0.15
United States   0.93    0.2     0.18    0.2     0.19    0.15
France  0.92    0.2     0.19    0.2     0.19    0.14
Lithuania       0.91    0.2     0.17    0.2     0.19    0.16
Estonia 0.91    0.2     0.2     0.19    0.17    0.15
Singapore       0.9     0.2     0.19    0.19    0.2     0.13
Spain   0.9     0.2     0.18    0.2     0.17    0.15
Malaysia        0.89    0.18    0.2     0.2     0.2     0.12
Norway  0.89    0.19    0.2     0.18    0.19    0.14
Canada  0.89    0.2     0.19    0.2     0.17    0.14
Australia       0.89    0.2     0.17    0.2     0.18    0.14

or if you prefer tabular instead of tab-separated output just pipe it to column:

$ awk -v RS='<tr><td>|</td></tr>' -F'(</?td>)+' -v OFS='\t' 'NF>1{$1=$1; print}' file | column -s$'\t' -t
United Kingdom  0.93  0.2   0.19  0.2   0.19  0.15
United States   0.93  0.2   0.18  0.2   0.19  0.15
France          0.92  0.2   0.19  0.2   0.19  0.14
Lithuania       0.91  0.2   0.17  0.2   0.19  0.16
Estonia         0.91  0.2   0.2   0.19  0.17  0.15
Singapore       0.9   0.2   0.19  0.19  0.2   0.13
Spain           0.9   0.2   0.18  0.2   0.17  0.15
Malaysia        0.89  0.18  0.2   0.2   0.2   0.12
Norway          0.89  0.19  0.2   0.18  0.19  0.14
Canada          0.89  0.2   0.19  0.2   0.17  0.14
Australia       0.89  0.2   0.17  0.2   0.18  0.14

jared_mamrot · Accepted Answer · 2021-05-31T01:01:36.227

Something like:

awk -F '</*td>|</*tr>' '{ for (i=5;i<=NF-1;i+=2) print $i }' file

should get you the values, but they will be output in a single column so you will need to format the column further with e.g.

awk -F '</*td>|</*tr>' '{ for (i=5;i<=NF-1;i+=2) print $i }' test | awk -v RS= '{gsub(/\n/, "\t")}1'
United Kingdom  0.93    0.2 0.19    0.2 0.19    0.15
United States   0.93    0.2 0.18    0.2 0.19    0.15
France  0.92    0.2 0.19    0.2 0.19    0.14
Lithuania   0.91    0.2 0.17    0.2 0.19    0.16
Estonia 0.91    0.2 0.2 0.19    0.17    0.15
Singapore   0.9 0.2 0.19    0.19    0.2 0.13
Spain   0.9 0.2 0.18    0.2 0.17    0.15
Malaysia    0.89    0.18    0.2 0.2 0.2 0.12
Norway  0.89    0.19    0.2 0.18    0.19    0.14
Canada  0.89    0.2 0.19    0.2 0.17    0.14
Australia   0.89    0.2 0.17    0.2 0.18    0.14

score 0 · Answer 3 · edited May 31 '21 at 02:06

Check how your regex is matching each field. Each line of the output starts with the number of the field you came up with. Empty lines mean <tr> and <td> were side by side.

You could try this, instead:

BEGIN{RS="<\/?t[dr]><\/?t[dr]>";ORS=" "}
$0 && !/^</
END{printf "\n"}

Try it online!

It removes all trs and tds, regarding them as record separators. The output record separator becomes a space. Then, only prints non-blank records that do not start with <, i.e., that do not start with a html tag. Also prints a trailing newline, for the sake of POSIX definition of a line.

score 0 · Answer 4 · answered May 31 '21 at 03:37

Based on your attempt, seems like number of columns per row is known to be 7. So, you can use tools like xpath or xmllint and then use pr to format the output.

$ # same as: xmllint --xpath '//tr/td/text()'
$ xpath -q -e '//tr/td/text()' ip.html | pr -7ats' '
United Kingdom 0.93 0.2 0.19 0.2 0.19 0.15
United States 0.93 0.2 0.18 0.2 0.19 0.15
France 0.92 0.2 0.19 0.2 0.19 0.14
Lithuania 0.91 0.2 0.17 0.2 0.19 0.16
Estonia 0.91 0.2 0.2 0.19 0.17 0.15
Singapore 0.9 0.2 0.19 0.19 0.2 0.13
Spain 0.9 0.2 0.18 0.2 0.17 0.15
Malaysia 0.89 0.18 0.2 0.2 0.2 0.12
Norway 0.89 0.19 0.2 0.18 0.19 0.14
Canada 0.89 0.2 0.19 0.2 0.17 0.14
Australia 0.89 0.2 0.17 0.2 0.18 0.14

awk - output

4 Answers4