-1

In bash I am trying to parse following file:

Input:

</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">

Wanted output:

12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples~withstuff"
11/07/2011 67100270 "https://stuff.com/findstones"

I got to the point that I have:

# less input.txt | sed -e "s/><tr><td//" -e "s/\///" -e "s/a>//" -e "s/<\/td><\/tr>//g" -e "s/<\/td><td>//g" -e "s/>$//g" -e "s/<a class=\"btn-down\" download href=//g"

<stuff.txt (15.18 KB)12/01/2015Large things158520312"https://resource.com/stones"
<flowers.pdf (83.03 MB)23/03/2011Large flowers872448000"https://resource.com/flosers with stuff"
<apples.pdf (281.16 MB)21/04/2012Large things like apples299009564"https://resource.com/apples"
<stones.pdf (634.99 MB)11/07/2011Large stones from mountains67100270"https://stuff.com/findstones"

Is there a easier way to parse it? I feel that it can be done much simpler and I am not even in the middle of parsing.

jww
  • 97,681
  • 90
  • 411
  • 885
creed
  • 172
  • 2
  • 13

3 Answers3

1

Could you please try following and let us know if this helps you.

awk -F"[><]" '{sub(/.*=/,"",$28);print $15,$23,$28}'  Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

I'm sure the best way to solve your problem is to use an HTML parser. Solution for shown sample of file:

sed -r 's/.*(..\/..\/....).*>([0-9]*)<\/.*href=([^>]*)>/\1 \2 \3/I' input.txt
0

Personally, I'd use perl, but that's not what you asked, so...

A pedantic stepwise approach, so that you can edit bits of the logic when needed.

Assuming the input is a file named x:

</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">

Try this:

sed -E '
 s/>$//;
 s/href=/>/;
 s/(<[^>]+>)+/~/g;
 s/~[^~]+~//;
 s/~[^~]+~/ /;
 s/~/ /;
' x

Output:

12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples"
11/07/2011 67100270 "https://stuff.com/findstones"

Explained:

sed -E '

This uses extended regexes, and opens a script of sed code so that I can list each pattern individually. Each will be executed in order on each line, so it's not super efficient, but it's "readable" as regex code goes, and reasonably maintainable once you understand it, and so easy to edit when something needs tweaking.

s/>$//;

Strip the closing > off the end, to preserve the URL before squashing out all the other tags.

s/href=/>/;

use the href= as a hook to insert the > back so we can squash out all the tags in one pass.

s/(<[^>]+>)+/~/g;

Convert ALL the strings of tags and everything still in them to a simple delimiter each.

s/~[^~]+~//;

Eliminate the leading and second delimiter and the first unneeded field between them.

s/~[^~]+~/ /;

Eliminate the third and fourth delimiters and the unneeded third field between them, replacing them with the space you wanted in the output.

Those two are very similar, and could certainly be combined with minimal shenannigans, but I left them nigh-redundant for easier explication.

s/~/ /;

Convert the remaining delimiter to the other space you wanted between the remaining fields.

' x

Close the script and give it the filename to read.

Obviously, this leaves a LOT of room for improvement, and is in many ways stylistically repulsive, but hopefully it is a simple explanation of tricks you can hack into a maintainably useful solution to your issue.

Good luck.

Paul Hodges
  • 13,382
  • 1
  • 17
  • 36