3

I was trying to get configure one parse tree for the below HTML table,but couldn't form it.I want to see how the tree structure looks like!can anyone help me here?

# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

EDIT

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\matt>easy_install ete2
Searching for ete2
Reading http://pypi.python.org/simple/ete2/
Reading http://ete.cgenomics.org
Reading http://ete.cgenomics.org/releases/ete2/
Reading http://ete.cgenomics.org/releases/ete2
Best match: ete2 2.1rev539
Downloading http://ete.cgenomics.org/releases/ete2/ete2-2.1rev539.tar.gz
Processing ete2-2.1rev539.tar.gz
Running ete2-2.1rev539\setup.py -q bdist_egg --dist-dir c:\users\arupra~1\appdat
a\local\temp\easy_install-sypg3x\ete2-2.1rev539\egg-dist-tmp-zemohm

Installing ETE (A python Environment for Tree Exploration).

Checking dependencies...
numpy cannot be found in your python installation.
Numpy is required for the ArrayTable and ClusterTree classes.
MySQLdb cannot be found in your python installation.
MySQLdb is required for the PhylomeDB access API.
PyQt4 cannot be found in your python installation.
PyQt4 is required for tree visualization and image rendering.
lxml cannot be found in your python installation.
lxml is required from Nexml and Phyloxml support.

However, you can still install ETE without such functionality.
Do you want to continue with the installation anyway? [y,n]y
Your installation ID is: d33ba3b425728e95c47cdd98acda202f
warning: no files found matching '*' under directory '.'
warning: no files found matching '*.*' under directory '.'
warning: manifest_maker: MANIFEST.in, line 4: path 'doc/ete_guide/' cannot end w
ith '/'

warning: manifest_maker: MANIFEST.in, line 5: path 'doc/' cannot end with '/'

warning: no previously-included files matching '*.pyc' found under directory '.'

zip_safe flag not set; analyzing archive contents...
Adding ete2 2.1rev539 to easy-install.pth file
Installing ete2 script to C:\Python27\Scripts

Installed c:\python27\lib\site-packages\ete2-2.1rev539-py2.7.egg
Processing dependencies for ete2
Finished processing dependencies for ete2
fransua
  • 1,559
  • 13
  • 30
Arup Rakshit
  • 116,827
  • 30
  • 260
  • 317
  • 1
    @Oded, I guess with python:) – allergic Jan 05 '13 at 13:03
  • @Oded I just want to see how the tree structure of it looks like. Basically i am using python package where it processes `html` doc as a parse tree. So I want to see the tree structure of it. So if you help for the same,I would be helpful! – Arup Rakshit Jan 05 '13 at 13:04
  • I can't , as I am not a python guy (now you now why you should tag the question with the language). It is not clear to me how you want to see the parse tree either - you need to expand on that too. – Oded Jan 05 '13 at 13:09
  • 1
    @Oded Just I want to see how it looks like in a `tree like structure`? that's it. not needed to be in python like tree. python also produces it the standard way. It should be a Top-down parse tree – Arup Rakshit Jan 05 '13 at 13:13
  • 1
    Why not _edit_ the question and add these details to it? – Oded Jan 05 '13 at 13:15
  • I thought I already explained - I don't know python. Someone else who does may be able to help. But you really should edit the question to have all relevant information. – Oded Jan 05 '13 at 13:35
  • Do you want a parsed tree data structure or a visualization of the tree ? – vivek Jan 06 '13 at 10:57
  • @vivek I want to visualize the tree. – Arup Rakshit Jan 06 '13 at 11:16

2 Answers2

12

This answer comes a bit late, but still I'd like to share it: enter image description here

I used networkx and lxml (which I found to allow much more elegant traversal of the DOM-tree). However, the tree-layout depends on graphviz and pygraphviz installed. networkx itself would just distribute the nodes somehow on the canvas. The code actually is longer than required cause I draw the labels myself to have them boxed (networkx provides for drawing the labels but it doesn't pass on the bbox keyword to matplotlib).

import networkx as nx
from lxml import html
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout

raw = "...your raw html"

def traverse(parent, graph, labels):
    labels[parent] = parent.tag
    for node in parent.getchildren():
        graph.add_edge(parent, node)
        traverse(node, graph, labels)

G = nx.DiGraph()
labels = {}     # needed to map from node to tag
html_tag = html.document_fromstring(raw)
traverse(html_tag, G, labels)

pos = graphviz_layout(G, prog='dot')

label_props = {'size': 16,
               'color': 'black',
               'weight': 'bold',
               'horizontalalignment': 'center',
               'verticalalignment': 'center',
               'clip_on': True}
bbox_props = {'boxstyle': "round, pad=0.2",
              'fc': "grey",
              'ec': "b",
              'lw': 1.5}

nx.draw_networkx_edges(G, pos, arrows=True)
ax = plt.gca()

for node, label in labels.items():
        x, y = pos[node]
        ax.text(x, y, label,
                bbox=bbox_props,
                **label_props)

ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
plt.show()

Changes to the code if you prefer (or have) to use BeautifulSoup:

I'm no expert... just looked at BS4 for the first time,... but it works:

#from lxml import html
from bs4 import BeautifulSoup
from bs4.element import NavigableString

...

def traverse(parent, graph, labels):
    labels[hash(parent)] = parent.name
    for node in parent.children:
        if isinstance(node, NavigableString):
            continue
        graph.add_edge(hash(parent), hash(node))
        traverse(node, graph, labels)

...

#html_tag = html.document_fromstring(raw)
soup = BeautifulSoup(raw)
html_tag = next(soup.children)

...
Jeril
  • 7,858
  • 3
  • 52
  • 69
tzelleke
  • 15,023
  • 5
  • 33
  • 49
  • what are the `python packages` I need to install and can I use `easy_install` for the same? – Arup Rakshit Jan 06 '13 at 19:49
  • you need networkx, matplotlib, graphviz, pygraphviz, lxml. I installed all of them easyly from the package manager on Ubuntu 12.10. – tzelleke Jan 06 '13 at 19:52
  • 1
    You can download graphviz as a windows-binary - I just checked that. But lxml you would have to build from sources and provide the required dependencies (libxml2, libxslt). Building from sources and linking is inherently difficult on windows... So, honestly, I would suggest that you skip lxml. It is just needed here for parsing and traversing the HTML. Instead you could use beatfulsoup. The rest should be available via `pip install` or `easy_install`. You would also need numpy. – tzelleke Jan 06 '13 at 20:07
  • Yeah,BS4 i have already installed. So any code changes needed for the same? – Arup Rakshit Jan 06 '13 at 20:09
  • Well yeah, the `def traverse(...)` needs to change to use BS4-API. I may look into that, but I haven't used it so far... – tzelleke Jan 06 '13 at 20:13
  • Okay,please help me here, I am also never used this programming language. first time for me. So if you give a full code,taking your concept i can move ahead! your help is and will always be a kind of `god gift` for me! :) – Arup Rakshit Jan 06 '13 at 20:15
  • @PythonLikeYOU I have updated my answer to make it work with BS4. – tzelleke Jan 06 '13 at 21:19
  • Thanks for your time,but confused how to use your first one and second one combinely? – Arup Rakshit Jan 06 '13 at 22:14
7

Python modules:
1. ETE, but it requires Newick format data.
2. GraphViz + pydot. See this SO answer.

Javascript:
The amazing d3 TreeLayout which uses JSON format.

If you're using ETE then you'll need to convert html to newick format. Here's a small example I made:

from lxml import html
from urllib import urlopen


def getStringFromNode(node):
    # Customize this according to
    # your requirements.
    node_string = node.tag
    if node.get('id'):
        node_string += '-' + node.get('id')
    if node.get('class'):
        node_string += '-' + node.get('class')
    return node_string


def xmlToNewick(node):
    node_string = getStringFromNode(node)
    nwk_children = []
    for child in node.iterchildren():
        nwk_children.append(xmlToNewick(child))
    if nwk_children:
        return "(%s)%s" % (','.join(nwk_children), node_string)
    else:
        return node_string


def main():
    html_page = html.fromstring(urlopen('http://www.google.co.in').read())
    newick_page = xmlToNewick(html_page)
    return newick_page

main()

Output (http://www.google.co.in in newick format):

'((meta,title,script,style,style,script)head,(script,textarea-csi,(((b-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,(u)a-gb1)nobr)div-gbar,((span-gbn-gbi,span-gbf-gbf,span-gbe,a-gb4,a-gb4,a-gb_70-gb4)nobr)div-guser,div-gbh,div-gbh)div-mngb,(br-lgpd,(((div)div-hplogo)div,br)div-lga,(((td,(input,input,input,(input-lst)div-ds,br,((input-lsb)span-lsbb)span-ds,((input-lsb)span-lsbb)span-ds)td,(a,a)td-fl sblc)tr)table,input-gbv)form,div-gac_scont,(br,((a,a,a,a,a,a,a,a,a)font-addlang,br,br)div-als)div,(((a,a,a,a,a-fehl)div-fll)div,(a)p)span-footer)center,div-xjsd,(script)div-xjsi,script)body)html'

After that you can use ETE as showen in there examples.

Hope that helps.

Community
  • 1
  • 1
vivek
  • 4,951
  • 4
  • 25
  • 33