0

I need to parse an HTML table:

<table>
  <tbody>
    <tr class="blue"><td>code</td><td>rate</td></tr>
    <tr class="gray_1"><td><span>USD</span><em>1</em></td><td>476.16</td></tr>
    <tr class="gray_2"><td><span>AUD</span><em>1</em></td><td>327.65</td></tr>
    <tr class="gray_9"><td><span>IRR</span><em>100</em></td><td>1.13</td></tr>
    <tr class="blue"><td>some comment</td><td>some comment</td></tr>
    <tr class="gray_1"><td><span>EUR</span><em>1</em></td><td>526.54</td></tr>
  </tbody>
</table>

and as a result to get:

+------+----+----+------+
|class |code|em  |value |
+------+----+----+------+
|gray_1|USD |   1| 11.11|
|gray_2|AUD |   1| 22.22|
|gray_9|IRR | 100| 33.33|
|gray_1|EUR |   1| 44.44|

I tried to do it like this:

with tbl as
(
    select xmltype('
        <table>
          <tbody>
            <tr class="blue"><td>code</td><td>rate</td></tr>
            <tr class="gray_1"><td><span>USD</span><em>1</em></td><td>476.16</td></tr>
            <tr class="gray_2"><td><span>AUD</span><em>1</em></td><td>327.65</td></tr>
            <tr class="gray_9"><td><span>IRR</span><em>100</em></td><td>1.13</td></tr>
            <tr class="blue"><td>some comment</td><td>some comment</td></tr>
            <tr class="gray_1"><td><span>EUR</span><em>1</em></td><td>526.54</td></tr>
          </tbody>
        </table>
    ') xml_data from dual
)
select
    *
from
    tbl,
    xmltable('//table/tbody/tr'
        passing tbl.xml_data
        columns
            data varchar2(128) path './td'
    )

But this causes an exception:

ORA-19279: XPTY0004 - XQuery dynamic type mismatch: expected singleton sequence - got multi-item sequence

What do I need to do to fix my code?

Alex Poole
  • 183,384
  • 11
  • 179
  • 318
Zurab
  • 1,411
  • 3
  • 17
  • 39
  • 1
    XML is not fully compatible with HTML so using XML methods on HTML is not recommended. – jdweng Sep 24 '19 at 08:53
  • @jdweng I realize that, but is there a better way to parse html with Oracle? I need to get and parse html page by schedule, so I do it using an oracle job which calls a pl/sql procedure. I just find certain piece if a page, then throw away tags like
    ,   etc. to get a code that could be parsed like an XML code
    – Zurab Sep 24 '19 at 09:53
  • Most people recommend HtmlAgilityPack. I really don't think it that bad to use xml, but every time I do it I get lots of negative points. – jdweng Sep 24 '19 at 10:11

1 Answers1

2

Your path is looking for a td under the tr; but there are two, hence the "got multi-item sequence" error you're seeing. You can reference each td tag by its position, as td[1] etc. It's very reliant on the table structure being as expected though.

With this specific example you can do:

with tbl as
(
    select xmltype('
        <table>
          <tbody>
            <tr class="blue"><td>code</td><td>rate</td></tr>
            <tr class="gray_1"><td><span>USD</span><em>1</em></td><td>476.16</td></tr>
            <tr class="gray_2"><td><span>AUD</span><em>1</em></td><td>327.65</td></tr>
            <tr class="gray_9"><td><span>IRR</span><em>100</em></td><td>1.13</td></tr>
            <tr class="blue"><td>some comment</td><td>some comment</td></tr>
            <tr class="gray_1"><td><span>EUR</span><em>1</em></td><td>526.54</td></tr>
          </tbody>
        </table>
    ') xml_data from dual
)
select
    x.class, x.currency, x.amount, to_number(x.rate) as rate
from
    tbl
cross join
    xmltable('/table/tbody/tr'
        passing tbl.xml_data
        columns
            class varchar2(10) path '@class',
            currency varchar2(3) path 'td[1]/span',
            amount number path 'td[1]/em',
            rate varchar2(50) path 'td[2]'
    ) x
where
    x.currency is not null

which gets:

CLASS      CUR     AMOUNT       RATE
---------- --- ---------- ----------
gray_1     USD          1     476.16
gray_2     AUD          1     327.65
gray_9     IRR        100       1.13
gray_1     EUR          1     526.54

It won't take much variation in the HTML to break it though. See this answer for some reasons it is fragile, and why it generally considered unwise to try to parse HTML as XML.

Alex Poole
  • 183,384
  • 11
  • 179
  • 318