2

Each of the "7-pack" search results here contains an address and a phone number for each entry down the right hand side thus:

enter image description here

For each, I want to extract (i) the address and (ii) the phone number. The problem is, here is how these elements are defined in HTML:

<div style="width:146px;float:left;color:#808080;line-height:18px"><span>Houston, TX</span><br><span>United States</span><br><nobr><span>(713) 766-6663</span></nobr></div>

So there is no class name, css selector, or id from which I can use a find_element_by*(), I won't know the link text, so I can't use find_element_by_partial_link_text(), and WebDriver does not provide a method for finding by style, as far as I am aware. How do we work around this? I need to reliably be able to extract the right data every time, for each search result, for varying queries.

Language binding to WebDriver is Python.

Pyderman
  • 14,809
  • 13
  • 61
  • 106
  • I could use find_element_by_xpath("id('lclbox')/div/div[i]/div/div[2]/div[3]").text and iterate over i, but this feels rather unwieldy, not to mention brittle. – Pyderman Jun 26 '15 at 14:31

1 Answers1

4

There are at least two key things you can rely on: the container box with id="lclbox" and elements with class="intrlu" corresponding to each result item.

How to extract the address and a phone number from each result item can vary, here is one option (definitely, not beautiful) involving locating the phone number via regex check of each span element text:

import re

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver


driver = webdriver.Chrome()
driver.get('https://www.google.com/?gws_rd=ssl#q=plumbers%2Bhouston%2Btx')

# waiting for results to load
wait = WebDriverWait(driver, 10)
box = wait.until(EC.visibility_of_element_located((By.ID, "lclbox")))

phone_re = re.compile(r"\(\d{3}\) \d{3}-\d{4}")

for result in box.find_elements_by_class_name("intrlu"):
    for span in result.find_elements_by_tag_name("span"):
        if phone_re.search(span.text):
            parent = span.find_element_by_xpath("../..")
            print parent.text
            break
    print "-----"

I'm pretty sure it can be improved, but hope it would give you a starting point. Prints:

Houston, TX
(713) 812-7070
-----
Houston, TX
(713) 472-5554
-----
6646 Satsuma Dr
Houston, TX
(713) 896-9700
-----
1420 N Durham Dr
Houston, TX
(713) 868-9907
-----
5630 Edgemoor Dr
Houston, TX
(713) 665-5890
-----
5403 Kirby Dr
Houston, TX
(713) 224-3747
-----
Houston, TX
(713) 385-0349
-----
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195