0

I have one xml file which has some html content like bold, paragraph and tables. I have written shell script to parse all html tags except tables. I'm using XML (R package) to parse the data.

<Root>
    <Title> This is dummy xml file </Title>
    <Content> This table summarises data in BMC format.
        <div class="abctable">
            <table border="1" cellspacing="0" cellpadding="0" width="100%"   class="coder">
                <tbody>
                    <tr>
                        <th width="50%">ABC</th>
                        <th width="50%">Weight status</th>
                    </tr>
                    <tr>
                        <td>are 18.5</td>
                        <td>arew</td>
                    </tr>
                    <tr>
                        <td>18.5 &amp;mdash; 24.9</td>
                        <td>rweq</td>
                    </tr>
                    <tr>
                        <td>25.0 &amp;mdash; 29.9</td>
                        <td>qewrte</td>
                    </tr>
                    <tr>
                        <td>30.0 and hwerqer</td>
                        <td>rwqe</td>
                    </tr>
                    <tr>
                        <td>40.0 rweq rweq</td>
                        <td>rqwe reqw</td>
                    </tr>
                </tbody>
            </table>
        </div>
    </Content>
    <Section>blah blah blah</Section>
</Root>

How to parse the content of this table which in present in xml?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Manish
  • 3,341
  • 15
  • 52
  • 87

2 Answers2

2

Well there is a function called readHTMLTable in the XML package, that seems to do just what you need ?

Here is a way to do it with the following xml file :

<Root>
    <Title> This is dummy xml file </Title>
    <Content>
      This table summarises data in BMC format.

     <div class="abctable">
     <table border="1" cellspacing="0" cellpadding="0" width="100%"   class="coder">
   <tbody>
   <tr>
       <th width="50%">ABC</th><th width="50%">Weight status</th>
   </tr>
   <tr>
       <td>are 18.5</td>
       <td>arew</td>
   </tr>
   <tr>
       <td>18.5 &amp;mdash; 24.9</td>
       <td>rweq</td>
   </tr>
   <tr>
       <td>25.0 &amp;mdash; 29.9</td>
       <td>qewrte</td>
   </tr>
   <tr>
       <td>30.0 and hwerqer</td>
       <td>rwqe</td>
   </tr>
   <tr>
       <td>40.0 rweq rweq</td>
       <td>rqwe reqw</td>
   </tr>
   </tbody>
  </table>
   </Content>
 </div>
 <Section>blah blah blah</Section>
 </Root>

If this is saved in a file called /tmp/data.xml then you can use the following code :

doc <- htmlParse("/tmp/data.xml")
tableNodes <- getNodeSet(doc, "//table")
tb <- readHTMLTable(tableNodes[[1]])

Which fives :

R> tb
                 V1            V2
1               ABC Weight status
2          are 18.5          arew
3 18.5 &mdash; 24.9          rweq
4 25.0 &mdash; 29.9        qewrte
5  30.0 and hwerqer          rwqe
6    40.0 rweq rweq     rqwe reqw
juba
  • 47,631
  • 14
  • 113
  • 118
  • 2
    If look at the command help page and its examples (`?readHTMLTable`), it seems that you just have to parse your XML, then select one `` element and use `readHTMLTable` on it to get the values. All of this is done with functions of the XML package.
    – juba Jan 25 '13 at 09:47
  • 1
    I made an attmpt to parse the above xml file (data.xml) : doc = xmlTreeParse("data.xml", useInternal = TRUE, encoding="UTF-8") top = xmlRoot(doc) table<-top[[2]] readHTMLTable[table] but i get error message: Error in readHTMLTable[table] : object of type 'closure' is not subsettable – Manish Jan 25 '13 at 10:00
  • 1
    Updated my answer with a working example (almost copy/pasted from the help page, by the way). – juba Jan 25 '13 at 10:22
  • 1
    i have uploaded one xml file at http://textuploader.com/?p=6&id=ZBwog. With this i cannot parse two tables using your code. Can u pls help me where i m wrong. – Manish Jan 29 '13 at 04:02
  • 1
    First, your xml file is not well-formed, most of your tags are converted to entities. Second, please read the help page of `readHTMLTable` to understand how to parse several tables in a file. – juba Jan 29 '13 at 08:04
1

The best method for xml parsing would be to use xpath expressions

Xpath Tutorial

Xpath and R

How to use XPath and R stackoverflow

Community
  • 1
  • 1
Billybonks
  • 1,568
  • 3
  • 15
  • 32