How to extract the headers of the individual search items using Selenium and Python

Question

I am learning python and trying to paste the search results from python.org. I'm using Selenium.

Steps I want to do:

open python.org
Search for term "array" (Displays the results)
paste the list of search items (print("searchResults"))

My Code:

from selenium import webdriver
import time

driver = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

#waiting to find the element before throwing error no element found
driver.implicitly_wait(10)
#driver.maximize_window()

#getting the website
driver.get("https://www.python.org/")
driver.implicitly_wait(5)
#finding element by id
driver.find_element_by_id("id-search-field").send_keys("arrays")
driver.find_element_by_id("submit").click()
print("Test Successful")

SearchResults = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div/section/form/ul")
print(SearchResults.text)

-> This pastes all results.

Now I want individual results items and their headers. When I inspect the searchresults on site, I get this: <a href="/dev/peps/pep-0209/">PEP 209 -- Multi-dimensional Arrays</a>

There is no Tag, no Class and no Name to use.

How do I use this to just get all the headers?

score 1 · Answer 1 · answered Jun 02 '20 at 23:43

Your SearchResults are retrieving an static xpath that of the "main" tag that contains the list of results you want: the

SearchResults = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div/section/form/ul")

If you inspect that search result page you will see that inside this UL tag there are several "< li >", each one containing a "< h3 >" with "< a >" containing the row "head line". From what you asked I reckon those are the elements you are trying to capture, so you could try:

SearchResults = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div/section/form/ul/li[*]/h3/a")

or even something like:

SearchResults = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div/section/form/ul")
ChildResults = SearchResults.find_elements_by_xpath('.//*')

I did not really tested the code, but the idea is supposed to work. At least for your first trial with Selenium. My main point here is: you are trying to read a list of elements looking only for their parent element, you should go one step further and look for the children ones.

Although I do recommend you to search online about Selenium best practices to use xpaths and searching elements, those kind of "huge static" xpaths can become a nightmare in the long run. The more flexible your elements identifiers are, the easier will be to maintain your code and make it robust for the future.

score 1 · Answer 2 · answered Jun 02 '20 at 23:45

Can you give this a try? Instead of using Xpath try using a CSS selector and break down each element:

from selenium import webdriver
import json
import time

driver = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

# Getting the website
driver.get("https://www.python.org/")
# Finding element by id
driver.find_element_by_id("id-search-field").send_keys("arrays")
driver.find_element_by_id("submit").click()
print("Test Successful")
for elem in driver.find_elements_by_css_selector("section.main-content ul li"):
    elem_data = {
        'title': elem.find_element_by_css_selector("h3").text,
        'content': elem.find_element_by_css_selector("p").text,
        'link': elem.find_element_by_css_selector("h3 a").get_attribute('href'),
    }
    print(json.dumps(elem_data, indent=4))
    break
# {
#     "title": "PEP 209 -- Multi-dimensional Arrays",
#     "content": "...arrays comprised of simple types, like numeric. How are masked-arrays implemented? Masked-arrays in Numeric 1 are implemented as a separate array class. With the ability to add new array types to Numeric 2, it is possible that masked-arrays in Numeric 2 could be implemented as a new array type instead of an array class. How are numerical errors handled (IEEE floating-point errors in particular)? It is not clear to the proposers (Paul Barrett and Travis Oliphant) what is the best or preferre...",
#     "link": "https://www.python.org/dev/peps/pep-0209/"
# }

Ryan o · Answer 3 · 2020-06-03T05:20:57.810

You could use selenium selector methods if you wanted.

Personally, i like to use Javascript and inject that and return the results. For this example i would do this:

have a javascript file containing the following:

return (()=>{
   parsed_results = [];
   search_results=document.getElementsByClassName('list-recent-events')[0].children;
   for(var i =0;i<search_results.length;i++){
      result = search_results[i];
      text = result.innerText;
      title = result.getElementsByTagName('a')[0].innerText;
      href = 'https://www.python.org'+ result.getElementsByTagName('a')[0].getAttribute('href');
      parsed_results.push([title, text, href]);
   }
   return parsed_results;
  })();

You can use it like this, after the page has loaded:

search_results = driver.execute_script(open('path/to/file.js').read())

Then you can just go through them like you do normally in python.

for r in search_results:
    text = r[0]
    href = r[1]
    title = r[2]

score 0 · Answer 4 · answered Jun 03 '20 at 09:20

To print the headers of all the individual search results using Selenium and Python you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

Using CSS_SELECTOR:

print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.list-recent-events.menu li>h3>a")))])

Using XPATH:

print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='list-recent-events menu']//li/h3/a")))])

Console Output:

['PEP 209 -- Multi-dimensional Arrays', 'PEP 207 -- Rich Comparisons', 'PEP 335 -- Overloadable Boolean Operators', 'PEP 535 -- Rich comparison chaining', 'Python Success Stories', 'PEP 574 -- Pickle protocol 5 with out-of-band data', 'Parade of the PEPs', 'PEP 3118 -- Revising the buffer protocol', 'PEP 465 -- A dedicated infix operator for matrix multiplication', 'PEP 358 -- The "bytes" Object', 'PEP 225 -- Elementwise/Objectwise Operators', 'Highlights: Python 2.4', 'PEP 211 -- Adding A New Outer Product Operator', 'EDU-SIG: Python in Education', 'PEP 204 -- Range Literals', 'PEP 455 -- Adding a key-transforming dictionary to collections', 'PEP 252 -- Making Types Look More Like Classes', 'PEP 586 -- Literal Types', 'PEP 579 -- Refactoring C functions and methods', 'PEP 3116 -- New I/O']

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

How to extract the headers of the individual search items using Selenium and Python

4 Answers4