There are some packages for parsing dom tree from xml content, like https://docs.python.org/2/library/xml.dom.minidom.html.
But I dont want to target xml, only html website page content.
from htmldom import htmldom
dom = htmldom.HtmlDom( "http://www.yahoo.com" ).createDom()
# Find all the links present on a page and prints its "href" value
a = dom.find( "a" )
for link in a:
print( link.attr( "href" ) )
but for this I am getting this error:
Error while reading url: http://www.yahoo.com
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/htmldom/htmldom.py", line 333, in createDom
raise Exception
Exception
See I already checked BeautifulSoup but is not what I want. Beautifulsoup work only for html page. If page content loaded dynamically using Javascript then it fails. I dont want to parse the elements using getElementByClassName
and similar. But dom.children(0).children(1)
something like this.
So is there any way like using headless browser, selenium using which I can parse entire DOM tree structure and going through child and subchild I can access targget element?