How to get table data from html table in xml?

Question

I have one xml file which has some html content like bold, paragraph and tables. I have written shell script to parse all html tags except tables. I'm using XML (R package) to parse the data.

<Root>
    <Title> This is dummy xml file </Title>
    <Content> This table summarises data in BMC format.
        <div class="abctable">
            <table border="1" cellspacing="0" cellpadding="0" width="100%"   class="coder">
                <tbody>
                    <tr>
                        <th width="50%">ABC</th>
                        <th width="50%">Weight status</th>
                    </tr>
                    <tr>
                        <td>are 18.5</td>
                        <td>arew</td>
                    </tr>
                    <tr>
                        <td>18.5 &amp;mdash; 24.9</td>
                        <td>rweq</td>
                    </tr>
                    <tr>
                        <td>25.0 &amp;mdash; 29.9</td>
                        <td>qewrte</td>
                    </tr>
                    <tr>
                        <td>30.0 and hwerqer</td>
                        <td>rwqe</td>
                    </tr>
                    <tr>
                        <td>40.0 rweq rweq</td>
                        <td>rqwe reqw</td>
                    </tr>
                </tbody>
            </table>
        </div>
    </Content>
    <Section>blah blah blah</Section>
</Root>

How to parse the content of this table which in present in xml?

juba · Answer 1 · 2013-01-25T10:22:25.790

2

Well there is a function called readHTMLTable in the XML package, that seems to do just what you need ?

Here is a way to do it with the following xml file :

<Root>
    <Title> This is dummy xml file </Title>
    <Content>
      This table summarises data in BMC format.

     <div class="abctable">
     <table border="1" cellspacing="0" cellpadding="0" width="100%"   class="coder">
   <tbody>
   <tr>
       <th width="50%">ABC</th><th width="50%">Weight status</th>
   </tr>
   <tr>
       <td>are 18.5</td>
       <td>arew</td>
   </tr>
   <tr>
       <td>18.5 &amp;mdash; 24.9</td>
       <td>rweq</td>
   </tr>
   <tr>
       <td>25.0 &amp;mdash; 29.9</td>
       <td>qewrte</td>
   </tr>
   <tr>
       <td>30.0 and hwerqer</td>
       <td>rwqe</td>
   </tr>
   <tr>
       <td>40.0 rweq rweq</td>
       <td>rqwe reqw</td>
   </tr>
   </tbody>
  </table>
   </Content>
 </div>
 <Section>blah blah blah</Section>
 </Root>

If this is saved in a file called /tmp/data.xml then you can use the following code :

doc <- htmlParse("/tmp/data.xml")
tableNodes <- getNodeSet(doc, "//table")
tb <- readHTMLTable(tableNodes[[1]])

Which fives :

R> tb
                 V1            V2
1               ABC Weight status
2          are 18.5          arew
3 18.5 &mdash; 24.9          rweq
4 25.0 &mdash; 29.9        qewrte
5  30.0 and hwerqer          rwqe
6    40.0 rweq rweq     rqwe reqw

edited Jan 25 '13 at 10:22

answered Jan 25 '13 at 08:23

juba

47,631
14
113
118

2

If look at the command help page and its examples (`?readHTMLTable`), it seems that you just have to parse your XML, then select one `` element and use `readHTMLTable` on it to get the values. All of this is done with functions of the XML package.
– juba Jan 25 '13 at 09:47
1

I made an attmpt to parse the above xml file (data.xml) : doc = xmlTreeParse("data.xml", useInternal = TRUE, encoding="UTF-8") top = xmlRoot(doc) table<-top[[2]] readHTMLTable[table] but i get error message: Error in readHTMLTable[table] : object of type 'closure' is not subsettable – Manish Jan 25 '13 at 10:00
1

Updated my answer with a working example (almost copy/pasted from the help page, by the way). – juba Jan 25 '13 at 10:22
1

i have uploaded one xml file at http://textuploader.com/?p=6&id=ZBwog. With this i cannot parse two tables using your code. Can u pls help me where i m wrong. – Manish Jan 29 '13 at 04:02
1

First, your xml file is not well-formed, most of your tags are converted to entities. Second, please read the help page of `readHTMLTable` to understand how to parse several tables in a file. – juba Jan 29 '13 at 08:04

score 1 · Answer 2 · edited May 23 '17 at 12:02

1

The best method for xml parsing would be to use xpath expressions

Xpath Tutorial

Xpath and R

How to use XPath and R stackoverflow

edited May 23 '17 at 12:02

Community

1
1

answered Jan 25 '13 at 08:16

Billybonks

1,568
3
15
32

How to get table data from html table in xml?

2 Answers2

Linked