0

I need to convert an html list into a hierarchical structure in R (say a tree). I was trying to use the data.tree package along with the XML package but with bad results I would say...

The full html for the table is available here: https://biocyc.org/META/class-subs-instances?object=Pathways, it is very huge so I wouldn't post the full code.

What I need to do is to maintain the same structure of the list but converting it into a hierarchical object such as a tree. I was thinking about the data.tree package as it has a very handy function called ToDataFrameTypeCol to convert hierarchical object in the form of a path (i.e. foo/bar/something) into tree objects directly.

I would appreciate any ideas,

thanks in advance,

cheers,

Giovanni

Edits following MrFlick's comment

Here it is an example of what I did and what I need to do. To keep everything as simplest as possible I'll report only the very first part of the table but it should be enough:

<b><a href="/META/NEW-IMAGE?object=Activation-Inactivation-Interconversion">Activation/Inactivation/Interconversion</a></b>
<ul>
<li><b><a href="/META/NEW-IMAGE?object=Activation">Activation</a></b>
<ul>
<li><b><a href="/META/NEW-IMAGE?object=GLUCOSINOLATE-DEG">Glucosinolates Activation</a></b>
<ul>
<li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-6684">aromatic  glucosinolate activation</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5267">glucosinolate activation</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWYQT-4476">indole glucosinolate activation (herbivore attack)</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWYQT-4477">indole glucosinolate activation (intact plant cell)</a>
</li></ul>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-6012-1">acyl carrier protein activation</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-4441">DIMBOA-glucoside activation</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-7321">ecdysteroid metabolism (arthropods)</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-7895">ethionamide activation</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-1822">indole-3-acetate activation I</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-1921">indole-3-acetate activation II</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-7896">isoniazid activation</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5143">long-chain fatty acid activation</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5340">sulfate activation for sulfonation</a>
</li></ul>
</li><li><b><a href="/META/NEW-IMAGE?object=Inactivation">Inactivation</a></b>
<ul>
<li><b><a href="/META/NEW-IMAGE?object=Gibberellin-Inactivation">Gibberellin Inactivation</a></b>
<ul>
<li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-102">gibberellin inactivation I (2β-hydroxylation)</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-6477">gibberellin inactivation II (methylation)</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-6494">gibberellin inactivation III (epoxidation)</a>
</li></ul>
</li><li><b><a href="/META/NEW-IMAGE?object=Indole-3-Acetate-Inactivation">Indole-3-acetate Inactivation</a></b>
<ul>
<li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-1961">indole-3-acetate inactivation I</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-1962">indole-3-acetate inactivation II</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-1981">indole-3-acetate inactivation III</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-2021">indole-3-acetate inactivation IV</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5788">indole-3-acetate inactivation V</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5797">indole-3-acetate inactivation VI</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5811">indole-3-acetate inactivation VII</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5784">indole-3-acetate inactivation VIII</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-6219">indole-3-acetate inactivation VIII</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-1741">indole-3-acetate inactivation IX</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-1782">superpathway of indole-3-acetate conjugate biosynthesis</a>
</li></ul>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-6546">brassinosteroids inactivation</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-7321">ecdysteroid metabolism (arthropods)</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-7859">jasmonoyl-L-isoleucine inactivation</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-6297">tuberonate glucoside biosynthesis</a>
</li></ul>
</li><li><b><a href="/META/NEW-IMAGE?object=Interconversion">Interconversions</a></b>
<ul>
<li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5272">abscisic acid degradation by glucosylation</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-6012">acyl carrier protein metabolism</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5926">afrormosin conjugates interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=ARGORNPROST-PWY">arginine, ornithine and proline interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-2861">biochanin A conjugates interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-7057">cichoriin interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-2343">daidzein conjugates interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-7056">daphnin interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-7949">diadinoxanthin and diatoxanthin interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-2904">formononetin conjugates interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-2345">genistein conjugates interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5835">geranyl acetate biosynthesis</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-801">homocysteine and cysteine interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-2701">maackiain conjugates interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-2561">medicarpin conjugates interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-6303">methyl indole-3-acetate interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-6972">oleandomycin activation/inactivation</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-7075">phenylethyl acetate biosynthesis</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5114">UDP-sugars interconversion</a>
</li><li><a href="/META/NEW-IMAGE?type=PATHWAY&amp;object=PWY-5945">violaxanthin, antheraxanthin and zeaxanthin interconversion</a>
</li></ul>
</li></ul>

I'm trying to convert the table into a hierarchical structure with outer elements as parent of inner elements. So, let's take the element called: "gibberellin inactivation II (methylation)" it should be placed under the parent node "Gibberellin Inactivation" which, in turn, should be placed under the parent "Inactivation" and, following the same schema, under the last parent "Activation/Inactivation/Interconversion". In other words, I would need to get a string like "Activation/Inactivation/Interconversion;Inactivation;Gibberellin Inactivation;gibberellin inactivation II (methylation)" for each node. Just to make things a little more complicated, I should get the "object=XXXX" from each "href" of each leaf of the tree, so the complete string should look like: "Activation/Inactivation/Interconversion;Inactivation;Gibberellin Inactivation;gibberellin inactivation II (methylation);PWY-6477".

I have tried to parse the html file this way:

library(data.tree)
library(XML)    

html <- readLines("html_example.html")
doc <- htmlParse(html, asText = T)

html.list <- xmlToList(doc)
node <- as.Node(html.list)

But I get something that is difficult to parse...

Thanks again!

ciao,

Giovanni

Giovanni
  • 121
  • 1
  • 1
  • 11
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick May 22 '18 at 15:01
  • My fault, it was a long list so I was reluctant to add the full example. I have added a short example to my question. I hope that helps. – Giovanni May 23 '18 at 12:49

0 Answers0