2

I'm using lxml implementation in python for HTML and XML parsing. Setting up a parser like

parser = lxml.etree.HTMLParser()

and returning a tree from HTML source (string)

tree = lxml.etree.fromstring(html, parser).getroottree() # Returns a XML tree

According to lxml docs, this should return a DOM tree (XML)

I want to find certain elements having tags such as "a", "div", "span", etc.

How can I get the XPath of all possible elements using their tag names?

EDIT: I am actually developing a AJAX crawler, so I need Selenium to click certain elements which can change DOM state. I send the HTML source to the lxml for analysis.

For example, taking default elements in a list like

["a", "button", "li", "nav", "ol", "span", "ul", "header", "footer", "section"]

I need to get xpaths of the above elements so that I can pass them to Selenium for click, and for other event triggers.

  • 1
    Could you provide a simple example: input (html fragment) and the desired output? Thanks. – alecxe Jun 02 '14 at 16:00

2 Answers2

2

You don't really need to use a separate parser, selenium itself is pretty powerful in terms of Locating Elements:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get('url_goes_here')
list_of_elements = ["a", "button", "li", "nav", "ol", "span", "ul", "header", "footer", "section"]
for tag_name in list_of_elements:
    for element in browser.find_elements_by_tag_name(tag_name):
         print element
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • But Selenium is what makes it slow. Do you know a method to do this using lxml? Thanks. – user3623152 Jun 02 '14 at 17:47
  • @user3623152 I think you have to stick with `selenium` since you need webdriver element instances for further actions - you said you would need to click and trigger events on the elements found. – alecxe Jun 02 '14 at 17:49
0

I've always found using "beautiful Soup" makes this sort of thing much easier.

http://lxml.de/elementsoup.html

There are already a number of similar questions here, try:

retrieve links from web page using python and BeautifulSoup

Community
  • 1
  • 1
John T
  • 470
  • 3
  • 13
  • Beautiful soup is too cumbersome, and doesn't parse as good and fast as lxml. This is taking into fact that I need the crawler to be high performing. Also the link you provided is just on taking *href* links from the HTML source. – user3623152 Jun 02 '14 at 16:07
  • My apologies, your question was updated somwhat while I was posting. I've never had performance issues with bs, but I've generally used it for off-line, after-the-fact data crunching. – John T Jun 02 '14 at 16:11
  • Look at lxml, it is fast, really fast and good at broken HTML. If I give the crawler to spider maybe 5 targets, then BS would be a huge bottleneck. :) – user3623152 Jun 02 '14 at 16:13
  • 1
    Beautiful soup takes a parser argument in the constructor. I tell beautiful soup to use lxml as the parser. – user137717 Mar 06 '16 at 20:20