1

This is a question on web scraping. I am able to scrape sites using BeautifulSoup but I want to use XPaths because of the "Copy Xpath" function that Chrome that makes it super easy. My understanding is that Xpath is easier because to use BeautifulSoup we need HTML id that one needs to manually generate.

For example, following is a title I am getting but have to generate the 'find' part manually. If it was Xpath, my understanding is that I could just do "Copy XPath" from Chrome 'Inspect Element' window.

import requests
from bs4 import BeautifulSoup

url = "http://www.indeed.com/jobs?q=hardware+engineer&l=San+Francisco%2C+CA"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

job_titles = soup.find_all("h2", {"class", "jobtitle"})
jobs_sponsored = soup.find_all("div", {"data-tn-component", "sponsoredJob"})

for title in job_titles:
    print title.text.strip()
    print "SPONSORED JOB LISTINGS"
    print "\n"

for sponsored in jobs_sponsored:
    print sponsored.text.strip()

What would the equivalent code using XPaths look like? I am not able to find the library / syntax on how to extract content using Xpath instead of html ids.

EDIT: The quesion is NOT whether I can use Xpath with BeautifulSoup (I already know I cannot). The question is what would some or all of the statements above look like if I wanted to use XPath? What package (I dont have to use BeautifulSoup) do I need to use?

user1406716
  • 9,565
  • 22
  • 96
  • 151

1 Answers1

3

As you've already mentioned, BeautifulSoup does not offer XPath functionality, but there are CSS selectors built-in - with a limited support, but it is usually enough for the most common use cases. The following is how to apply them in this case:

soup.select("h2.jobtitle")
soup.select("div[data-tn-component=sponsoredJob]")

Note that the "Copy XPath" built into Chrome functionality would produce an absolute XPath expression - an absolute path to the element starting from the root html element (or with the first parent having the id attribute). Which, in general, is quite fragile - the relative positions of the elements and all the parents of the desired element(s) would all make the locator easily breakable - in this case you'd be very much design and layout dependent which you should always try to avoid. Do not simply trust the locator Chrome auto-derived for you - see if you can make it better.

If you need a Python HTML Parser with XPath support built-in, look into lxml.html.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I know. May be I need to edit the language of the question, but what package/syntax can I use to get the same data using XPath? I dont have to use BeautifulSoup. – user1406716 Jan 04 '16 at 03:18
  • @user1406716 ok, I think you can check the answer now. Note that if you are looking for tools and packages, this would be off-topic on SO. – alecxe Jan 04 '16 at 03:24
  • 1
    I believe chrome's functionality will give you the absolute xpath from the nearest parent element with an `id` defined. But yeah +1, it's extremely fragile. – roippi Jan 04 '16 at 03:46
  • @roippi I think so too, thanks, made a note. – alecxe Jan 04 '16 at 04:33