Parsing HTML table in Bash using sed

Question

In bash I am trying to parse following file:

Input:

</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">

Wanted output:

12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples~withstuff"
11/07/2011 67100270 "https://stuff.com/findstones"

I got to the point that I have:

# less input.txt | sed -e "s/><tr><td//" -e "s/\///" -e "s/a>//" -e "s/<\/td><\/tr>//g" -e "s/<\/td><td>//g" -e "s/>$//g" -e "s/<a class=\"btn-down\" download href=//g"

<stuff.txt (15.18 KB)12/01/2015Large things158520312"https://resource.com/stones"
<flowers.pdf (83.03 MB)23/03/2011Large flowers872448000"https://resource.com/flosers with stuff"
<apples.pdf (281.16 MB)21/04/2012Large things like apples299009564"https://resource.com/apples"
<stones.pdf (634.99 MB)11/07/2011Large stones from mountains67100270"https://stuff.com/findstones"

Is there a easier way to parse it? I feel that it can be done much simpler and I am not even in the middle of parsing.

might be better suited to use html/xml parsers instead of regex — Sundeep, Jun 22 '18 at 15:43
Add your xml/html file to your question and your desired output. — Cyrus, Jun 22 '18 at 18:40
[Don't Parse HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) — Cyrus, Jun 22 '18 at 18:42
@Jasen: I suggest I suggest [xmlstarlet](https://stackoverflow.com/tags/xmlstarlet/info). — Cyrus, Jun 23 '18 at 09:20

score 1 · Answer 1 · answered Jun 22 '18 at 15:24

1

Could you please try following and let us know if this helps you.

awk -F"[><]" '{sub(/.*=/,"",$28);print $15,$23,$28}'  Input_file

answered Jun 22 '18 at 15:24

RavinderSingh13

130,504
14
57
93

@creed, try to select answers as correct answers, also try to up-vote people for their efforts too who are helping you. – RavinderSingh13 Jun 22 '18 at 15:32

score 1 · Answer 2 · answered Jun 22 '18 at 16:36

1

I'm sure the best way to solve your problem is to use an HTML parser. Solution for shown sample of file:

sed -r 's/.*(..\/..\/....).*>([0-9]*)<\/.*href=([^>]*)>/\1 \2 \3/I' input.txt

answered Jun 22 '18 at 16:36

Thanks for the efforts. The solution does not neex to be fast, it just need to work. – creed Jun 23 '18 at 15:58

Paul Hodges · Accepted Answer · 2018-06-25T13:28:28.883

Personally, I'd use perl, but that's not what you asked, so...

A pedantic stepwise approach, so that you can edit bits of the logic when needed.

Assuming the input is a file named x:

</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">

Try this:

sed -E '
 s/>$//;
 s/href=/>/;
 s/(<[^>]+>)+/~/g;
 s/~[^~]+~//;
 s/~[^~]+~/ /;
 s/~/ /;
' x

Output:

12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples"
11/07/2011 67100270 "https://stuff.com/findstones"

Explained:

sed -E '

This uses extended regexes, and opens a script of sed code so that I can list each pattern individually. Each will be executed in order on each line, so it's not super efficient, but it's "readable" as regex code goes, and reasonably maintainable once you understand it, and so easy to edit when something needs tweaking.

s/>$//;

Strip the closing > off the end, to preserve the URL before squashing out all the other tags.

s/href=/>/;

use the href= as a hook to insert the > back so we can squash out all the tags in one pass.

s/(<[^>]+>)+/~/g;

Convert ALL the strings of tags and everything still in them to a simple delimiter each.

s/~[^~]+~//;

Eliminate the leading and second delimiter and the first unneeded field between them.

s/~[^~]+~/ /;

Eliminate the third and fourth delimiters and the unneeded third field between them, replacing them with the space you wanted in the output.

Those two are very similar, and could certainly be combined with minimal shenannigans, but I left them nigh-redundant for easier explication.

s/~/ /;

Convert the remaining delimiter to the other space you wanted between the remaining fields.

' x

Close the script and give it the filename to read.

Obviously, this leaves a LOT of room for improvement, and is in many ways stylistically repulsive, but hopefully it is a simple explanation of tricks you can hack into a maintainably useful solution to your issue.

Good luck.

Parsing HTML table in Bash using sed

3 Answers3