Parsing text with sed / awk

Question

I am trying to parse an html table in order to obtain the values. See here.

    <tr>
            <th>CLI:</th>
            <td>0044123456789</td>
    </tr>

    <tr>
            <th>Call Type:</th>
            <td>New Enquiry</td>
    </tr>

    <tr>
            <th class=3D"nopaddingtop">Caller's Name:</th>
            <td class=3D"nopaddingtop">&nbsp;</td>
    </tr>

    <tr>
            <th class=3D"nopaddingmid"></th>
            <td class=3D"nopaddingmid">Mr</td>
    </tr>

    <tr>
            <th class=3D"nopaddingmid"></th>
            <td class=3D"nopaddingmid">Lee</td>
    </tr>

    <tr>
            <th class=3D"nopaddingbot"></th>
            <td class=3D"nopaddingbot">Butler</td>
    </tr>

I want to read the values associated wit the "CLI", "Call Type", and "Caller's Name" into separate variables using sed / awk.

For example:

cli="0044123456789"
call_type="New Enquiry"
caller_name="Mr Lee Butler"

How can I do this?

Many thanks, Neil.

If it's valid HTML I recommend to use an XML-Parser like `xmllint`. — Cyrus, Nov 15 '14 at 18:59
agree about using XML-Parser, but not clear if you want to just find `CLI` (etc) or the value(s) associated (`0044123456789`) ? Please update your question, rather than answering as a comment. Good luck. — shellter, Nov 15 '14 at 19:13

Gilles Quénot · Answer 1 · 2014-11-17T13:49:01.957

2

One example for CLI one :

var=$(xmllint --html --xpath '//th[contains(., "CLI")]/../td/text()' file.html)
echo "$var"

For the multi <tr> part :

$ for key in {4..6}; do
    xmllint \
        --html \
        --xpath "//th[contains(., 'CLI')]/../../tr[$key]/td/text()" file.html
    printf ' '
done
echo

Output:

Mr Lee Butler

edited Nov 17 '14 at 13:49

answered Nov 15 '14 at 19:06

Gilles Quénot

173,512
41
224
223

Hello - Thank you. This works for "cli" and "call_type" but not for "caller name". Any ideas? Many thanks, Neil. – Neil Reardon Nov 15 '14 at 19:32
1

It works, but in HTML ` ` is a space, so `xmllint` displays a space – Gilles Quénot Nov 15 '14 at 20:03
Hi, it only outputs "Mr". It does not output the rest of the name "Lee Butler". Regards, Neil. – Neil Reardon Nov 15 '14 at 20:26
Hi, yes, I know. Unfortunately I have no control over the quality of the source data. :-( It arrives in an email. I have to strip the headers and footers to get the "html". Then I have to parse the "html" in order to obtain the values. – Neil Reardon Nov 16 '14 at 11:22
Added a solution for multiple `` lines – Gilles Quénot Nov 16 '14 at 15:56

Parsing text with sed / awk

1 Answers1