Linux Bash Regex on a HTML table get one row

Question

result=$(
wget -qO- 'http://www.kuchenpeter.at/mittagsmenue.html' |
sed -n '/<p>/,/<\/p>/p'
)
echo $result

I try to get the menu from this page.

So i need 5 strings from the table see here

The bad thing on this page is when you look at the html code below they really messed up the tags.

<tr>
<td style="text-align: left; border-right: 1px solid #888;" valign="top">
    <p>
        <strong>
            <span style="font-size: 12px;">
                Puszta-Kotelett mit Pommes-frites
            </span>
        </strong>
    </p>
    <p>
        <span style="font-size: 12px;">
            &nbsp;
        </span>
    </p>
</td>
###########################################
<td style="text-align: left; border-right: 1px solid #888;" valign="top">
    <p>
        <span style="font-size: 12px;">
            <strong>
                Hühnergeschnetzeltes "Asia" mit Reis
            </strong>
        </span>
    </p>
    <p>
        &nbsp;
    </p>
</td>   
###########################################
<td style="text-align: left; border-right: 1px solid #888;" valign="top">
    <p>
        <span style="font-size: 12px;">
            <strong>
                <span style="font-size: 12px;">
                    <strong>
                        Tafelspitz mit Apfelkren, Schnittlauchsauce und Röstinchen
                    </strong>
                </span>
            </strong>
        </span>
    </p>
    <p>
        &nbsp;
    </p>
</td>
<td style="text-align: left; border-right: 1px solid #888;" valign="top">
    <p>
        <span style="font-size: 12px;">
            <strong>
                Puten-Picatta "Milanese" mit Salat
            </strong>
        </span>
    </p>
    <p>&nbsp;</p>
</td>
<td style="text-align: left;" valign="top">
    <p>
        <span style="font-size: 12px;">
            <strong>
                Gebratener Dorsch mit Gemüse und Petersilkartoffeln
            </strong>
        </span>
    </p>
    <p>
        <span style="font-size: 12px;">
            &nbsp;
        </span>
    </p>
</td>

You should use an HTML parser and make queries with e.g. XPath instead of using regex. — ssc-hrep3, Jan 30 '17 at 13:53
In your case, You need to strip the html tags; then it will be easy to extract the information you need. see [this](http://stackoverflow.com/questions/3790681/regular-expression-to-remove-html-tags) and [this](http://stackoverflow.com/questions/11229831/regular-expression-to-remove-html-tags-from-a-string) to know how to remove html tag using regex — Sourav Ghosh, Jan 30 '17 at 13:55
This is the answer you want: http://stackoverflow.com/a/1732454/1705337 — Morgoth, Jan 30 '17 at 14:01

score 2 · Answer 1 · answered Jan 30 '17 at 16:17

2

My Xidel can do it with pattern matching, which almost looks like a regex.

Everything in the row after "Tagessuppe":

xidel http://www.kuchenpeter.at/mittagsmenue.html -e "<tr>Tagessuppe</tr><tr><strong>{.}</strong>+</tr>"

Or in the third row:

xidel http://www.kuchenpeter.at/mittagsmenue.html -e "<div class="block"><tr/>{2}<tr><strong>{.}</strong>+</tr></div>"

answered Jan 30 '17 at 16:17

BeniBela

16,412
4
45
52

Using XPath: `xidel file.html -q --xpath '//table/tbody/tr[3]/td/p[1]//text()'` – Casimir et Hippolyte Jan 30 '17 at 17:07
Thank you BeniBela and Casimir et Hippolyte Xidel is realy the best tool I have ever seen to do that =) I used it like this now: `result1=$(xidel http://www.kuchenpeter.at/mittagsmenue.html --xpath '//table/tbody/tr[3]/td[1]/p[1]//text()')` For mor behind information I will use this to send me the menu of the week via a telegram bot =) – axi92 Jan 31 '17 at 06:21

Linux Bash Regex on a HTML table get one row

1 Answers1