0

I'm trying to filter a html file to get only specific values from the file. the file is a html report from metatrader and I would like to filter only the output values table from the html.

This is the sample of html file (report2.html)

<tr align="right">
   <td nowrap colspan="3">Net profit:</td>
   <td nowrap><b>17.74</b></td>
   <td nowrap colspan="3">Balance Drawdown Absolute:</td>
   <td nowrap><b>0.97</b></td>
   <td nowrap colspan="3">Absolute equity drawdown:</td>
   <td nowrap colspan="2"><b>1.39</b></td>
</tr>
<tr align="right">
   <td nowrap colspan="3">Gross Profit:</td>
   <td nowrap><b>43.91</b></td>
   <td nowrap colspan="3">Balance Drawdown Maximal:</td>
   <td nowrap><b>6.72 (0.07%)</b></td>
   <td nowrap colspan="3">Equity Drawdown Maximal:</td>
   <td nowrap colspan="2"><b>8.02 (0.08%)</b></td>
</tr>
<tr align="right">
   <td nowrap colspan="3">Gross Loss:</td>
   <td nowrap><b>-26.17</b></td>
   <td nowrap colspan="3">Relative balance drawdown:</td>
   <td nowrap><b>0.07% (6.72)</b></td>
   <td nowrap colspan="3">Relative equity drawdown:</td>
   <td nowrap colspan="2"><b>0.08% (8.02)</b></td>
</tr>

If I use

grep --no-group-separator -A1 awdown report2.html | sed -n '/^$/!{s/<[^>]*>//g;p;}'

I get the folowing.

Balance Drawdown Absolute:
0.97
Absolute equity drawdown:
1.39
Balance Drawdown Maximal:
6.72 (0.07%)
Equity Drawdown Maximal:
8.02 (0.08%)
Relative balance drawdown:
0.07% (6.72)
Relative equity drawdown:
0.08% (8.02)

The problem is that I need to have the second line just after the first one with tab and I don't know how to make it and also need to have the filename in the first tab.

Output expected is something like this:

report2.html    Balance Drawdown Absolute:  0.97
report2.html    Absolute equity drawdown:   1.39
report2.html    Balance Drawdown Maximal:   6.72 (0.07%)
report2.html    Equity Drawdown Maximal:    8.02 (0.08%)
report2.html    Relative balance drawdown:  0.07% (6.72)
report2.html    Relative equity drawdown:   0.08% (8.02)

Anyone can help me how to achieve this output?

Thank you

Fabio
  • 5
  • 1
  • Possible duplicate of [Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms](https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la) – Cyrus Jun 25 '19 at 12:21

2 Answers2

0

Try the below:

grep --no-group-separator -A1 awdown report2.html | sed -n '/^$/!{s/<[^>]*>//g;p;}' | sed '$!N;s/\n//'

i have simply added another stream edit command after yours

Rajesh G
  • 26
  • 4
  • Hello. It works, but still don't have the file name as expected output explained. and, could you please explain the sed formula you have used? Thank you Can I use similar to replace end of line with : by something like you mention? I mean If I have in all end of lines the character : can I apply something to match this character and bring the second line to the end? not sure I'm clear. thank you – Fabio Jun 25 '19 at 20:50
0

Another alternative.

 grep --no-group-separator -A1 awdown report2.html | sed -n '/^$/!{s/<[^>]*>//g;p;}' | awk 'NR%2{printf "%s ",$0;next;}1'
Christina Jacob
  • 665
  • 5
  • 17