1

So I have been trying to figure our how to use BeautifulSoup and did a quick search and found lxml can parse the xpath of an html page. I would LOVE if I could do that but the tutorial isnt that intuitive.

I know how to use Firebug to grab the xpath and was curious if anyone has use lxml and can explain how I can use it to parse specific xpath's, and print them.. say 5 per line..or if it's even possible?!

Selenium is using Chrome and loads the page properly, just need help moving forward.

Thanks!

Prasad
  • 472
  • 5
  • 15
  • What is bs4? Wikipedia says its some sedan :) – Himanshu Dec 20 '12 at 04:47
  • @Himanshu Sorry- bs4 = beautifulsoup4 – twitch after coffee Dec 20 '12 at 04:59
  • Okay. To use xpath on xml docs with python, see element tree http://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support . You may not be able to parse all html docs right off the web as they may not be all valid xml docs. See http://stackoverflow.com/questions/285990/parse-html-via-xpath – Himanshu Dec 20 '12 at 05:31

2 Answers2

1

lxml's ElementTree has a .xpath() method (note that the ElementTree in the xml package in the Python distribution dosent have that!)

e.g.

# see http://lxml.de/xpathxslt.html

from lxml import etree

# root = etree.parse('/tmp/stack-overflow-questions.xml')
root = etree.XML('''
        <answers>
            <answer author="dlam" question-id="13965403">AAA</answer>
        </answers>
''')

all_answers = root.xpath('.//answer')

for i, answer in enumerate(all_answers):
    who_answered = answer.attrib['author']
    question_id = answer.attrib['question-id']
    answer_text = answer.text
    print 'Answer #{0} by {1}: {2}'.format(i, who_answered, answer_text)
David Lam
  • 4,689
  • 3
  • 23
  • 34
0

I prefer to use lxml. Because the efficiency of lxml is more higher than selenium for large elements extraction. You can use selenium to get source of webpages and parse the source with lxml's xpath instead of the native find_elements_with_xpath in selenium.

stamaimer
  • 6,227
  • 5
  • 34
  • 55