Is there any way to parse DOM tree for website content?

Question

There are some packages for parsing dom tree from xml content, like https://docs.python.org/2/library/xml.dom.minidom.html.

But I dont want to target xml, only html website page content.

from htmldom import htmldom
dom = htmldom.HtmlDom( "http://www.yahoo.com" ).createDom()
# Find all the links present on a page and prints its "href" value
a = dom.find( "a" )
for link in a:
    print( link.attr( "href" ) )

but for this I am getting this error:

Error while reading url: http://www.yahoo.com
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/htmldom/htmldom.py", line 333, in createDom
    raise Exception
Exception

See I already checked BeautifulSoup but is not what I want. Beautifulsoup work only for html page. If page content loaded dynamically using Javascript then it fails. I dont want to parse the elements using getElementByClassName and similar. But dom.children(0).children(1) something like this.

So is there any way like using headless browser, selenium using which I can parse entire DOM tree structure and going through child and subchild I can access targget element?

Artjom B. · Answer 1 · 2015-11-03T09:59:09.770

The Python Selenium API provides you with everything you might need. You can start with

html = driver.find_element_by_tag_name("html")

or

body = driver.find_element_by_tag_name("body")

and then go from there with

body.find_element_by_xpath('/*[' + str(x) + ']')

which would be equivalent to "body.children(x-1)". You don't need to use BeautifulSoup or any other DOM traversal framework on top of that, but you certainly can by taking the page source and letting it be parsed by another library like BeautifulSoup:

soup = BeautifulSoup(driver.page_source)
soup.html.children[0] #...

score 0 · Answer 2 · answered Nov 03 '15 at 07:47

Yes, but it's not going to be simple enough to include the code in a SO post. You're on the right track though.

Basically you're going to need to use a headless renderer of your choice (e.g. Selenium) to download all the resources and execute the javascript. There's really no use reinventing the wheel there.

Then you'll need to echo the HTML from the headless renderer out to file on the page ready event (every headless browser I've worked with offers this ability). At that point you can use BeautifulSoup over that file to navigate the DOM. BeautifulSoup does support child-based traversal as you desire: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-down

Is there any way to parse DOM tree for website content?

2 Answers2