0

I'm trying to grab the business names from a Google local search results page such as this:

enter image description here

Given the following:

enter image description here

... I would have thought that the XPath //div[@class ="_rl"] or //*[@class ="_rl"] would suffice, but they each return nothing. I know I need to make the query more explicit/precise, but how exactly?

I'm using Python and lxml, if that is of relevance.

Pyderman
  • 14,809
  • 13
  • 61
  • 106

3 Answers3

1

You mention Python, but based upon your screenshot it seems that perhaps you want to just get the xpath from the browswer?

In Chrome Developer Tools, you can right click on the element and select "Copy XPath."

Chrome Copy XPath

duffn
  • 3,690
  • 8
  • 33
  • 68
  • Thanks, but I'm doing this programmatically. The browser is proving useful for seeing page structure, but only as a basis from which to form the XPaths and plug them into Python & lxml. Firebug & Web Developer add-ons are giving super-long XPaths that go right back up to the root node, and I'm looking for something more concise, – Pyderman Oct 22 '15 at 23:07
1

you're capturing the element enclosing the text, not the text enclosed in the element. you need to either get the text attribute of the returned object, or add to your xpath statement so it gets the text specifically:

#from the object
list_of_elements = tree.xpath('//div[@class ="_rl"]')
for l in list_of_elements:
    print(l.text)

#capture the text
list_of_text = tree.xpath('//div[@class ="_rl"]/text()')
for l in list_of_text:
    print(l)
tlastowka
  • 702
  • 5
  • 14
  • Hmm, yes I had tried both approaches to the getting the text .. empty list returned each time. See here: http://pastebin.com/v3Y2NPQc – Pyderman Oct 22 '15 at 23:40
  • 1
    urllib2 returns a file object as the base object, so you're not actually running your xp against the text. I usually use requests, but I think with urllib2, getting the text is response.read(). – tlastowka Oct 23 '15 at 00:13
  • I'm following the example given by @MartijnPieters, where he uses the response directly from `urllib2.urlopen()`: http://stackoverflow.com/a/11466033/1389110 – Pyderman Oct 26 '15 at 23:59
  • Have you tried with modified code?- no need to use complex xpath- if any problem then let me know.. – Learner Oct 28 '15 at 10:49
1

Below is the working code-

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
from selenium.webdriver.common.by import By
from lxml import etree
import lxml.html
from bs4 import BeautifulSoup


driver = webdriver.Chrome()
driver.get("https://www.google.com/ncr#q=chiropractors%2BNew+York,+NY&rflfq=1&rlha=0&tbm=lcl")
WebDriverWait(driver,1000).until(EC.presence_of_all_elements_located((By.TAG_NAME,"body")))

tree = etree.fromstring(driver.page_source)



print 'Using pure python-----------'*2
d=driver.find_elements_by_xpath("//div[@class='_pl _ki']")
for i in d:
    print i.text.split("\n")[0]

print 'Using bs4-----------------'*2
soup = BeautifulSoup(driver.page_source,'html.parser')
raw = soup.find_all('div', class_='_rl')
for i in raw:
    print i.text


print 'Using lxml---------------'*2

tree = lxml.html.fromstring(driver.page_source)

e=tree.cssselect("._rl")

for i in e:
    d = i.xpath('.//text()')
    print ''.join(d)


driver.close()

It prints:

Using pure python-----------Using pure python-----------
TAI Chiropractic
Body in Balance Chiropractic
Lamb Chiropractic
Esprit Wellness
Jamie H Bassel DC PC
Madison Avenue Chiropractic Center
Howard Benedikt DC
44'Th Street Chiropractic
Rockefeller Health & Medical Chiropractic
Frank J. Valente, DC, PC
Dr. Robert Shire
5th Avenue Chiropractic
Peterson Chiropractic
NYC Chiropractic Solutions
20 East Chiropractic of Midtown
GRAND CENTRAL CHIROPRACTIC WELLNESS CENTER
Park Avenue Chiropractic Center - Dr Nancy Jacobs
Murray Hill Chiropractic PC
Empire Sports & Spine
JW Chiropractic
Using bs4-----------------Using bs4-----------------
TAI Chiropractic
Body in Balance Chiropractic
Lamb Chiropractic
Esprit Wellness
Jamie H Bassel DC PC
Madison Avenue Chiropractic Center
Howard Benedikt DC
44'Th Street Chiropractic
Rockefeller Health & Medical Chiropractic
Frank J. Valente, DC, PC
Dr. Robert Shire
5th Avenue Chiropractic
Peterson Chiropractic
NYC Chiropractic Solutions
20 East Chiropractic of Midtown
GRAND CENTRAL CHIROPRACTIC WELLNESS CENTER
Park Avenue Chiropractic Center - Dr Nancy Jacobs
Murray Hill Chiropractic PC
Empire Sports & Spine
JW Chiropractic
Using lxml---------------Using lxml---------------
TAI Chiropractic
Body in Balance Chiropractic
Lamb Chiropractic
Esprit Wellness
Jamie H Bassel DC PC
Madison Avenue Chiropractic Center
Howard Benedikt DC
44'Th Street Chiropractic
Rockefeller Health & Medical Chiropractic
Frank J. Valente, DC, PC
Dr. Robert Shire
5th Avenue Chiropractic
Peterson Chiropractic
NYC Chiropractic Solutions
20 East Chiropractic of Midtown
GRAND CENTRAL CHIROPRACTIC WELLNESS CENTER
Park Avenue Chiropractic Center - Dr Nancy Jacobs
Murray Hill Chiropractic PC
Empire Sports & Spine
JW Chiropractic
Learner
  • 5,192
  • 1
  • 24
  • 36
  • Would you mind pasting your full code at pastebin.com? I'm struggling to get any of the XPaths above (including yours) to work on the response from urllib2. – Pyderman Oct 27 '15 at 00:01