0

I am trying to scrape external data to pre-fill form data on a website. The aim is to find a keyword, and return the class name of the element that contains that keyword. I have the constraints of not knowing if the website does have the keyword or what type of tag the keyword is within.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

chromeDriverPath = "./chromedriver"
chrome_options = webdriver.ChromeOptions()

driver = webdriver.Chrome(chromeDriverPath, options=options)
driver.get("https://www.scrapethissite.com/pages/")

#keywords to scrape for
listOfKeywords = ['ajax', 'click']
for keyword in listOfKeywords:
    try:
        foundKeyword = driver.find_element(By.XPATH, "//*[contains(text(), " + keyword + ")]")
        
        print(foundKeyword.get_attribute("class")) 

    except:
        pass
                           


driver.close()

This example returns the highest parent, not the immediate parent. To elaborate this example prints "" because it is trying to return the class attribute for the <html> tag which does not have a class attribute. Similarly if I changed the code to search for the keyword in a <div>

foundKeyword = driver.find_element(By.XPATH, "//div[contains(text(), " + keyword + ")]")

This prints "container", for both 'ajax' and 'click' because the div class='container' wraps everything on the website.

So the answer I want for the above example is, for the keyword 'ajax', it should print 'page-title' (the class of the immediate parent tag). Similarly, for 'click', I would expect it to print 'lead session-desc'.

The below image may help to visualize this

Scrape Example

Nick
  • 223
  • 2
  • 11
  • What is the website and are you sure this is the optimal approach? Is the keyword search and attempt at future-proofing? – QHarr Nov 11 '21 at 04:04
  • None of these return anything: //div[contains(text(), "ajax")] and //div[contains(text(), "click")] in this site https://www.scrapethissite.com/pages/. Neither your code on execution print anything. What is the correct locator for which you need the immediate parent class attribute? To make it more simple, can you elaborate, what exactly, you are trying to scrape from the website? – QualityMatters Nov 11 '21 at 05:37
  • 1
    @Nick - What is the expected output. I mean which class name should it print. – pmadhu Nov 11 '21 at 05:37
  • @pmadhu - I have added a image to help visualize. For the keyword AJAX I would like to return the class of the

    tag that wraps it, shown in red in the image. Similarly for the keyword 'click', I would like the class of the

    – Nick Nov 11 '21 at 06:09
  • @QualityMatters - unsure why your code isn't working my code is consistent with others I believe, see https://stackoverflow.com/a/18701085/15264946 . I have added an image to help visualize. I am trying to find AJAX the red item, and return the class seen in the DOM 'page-title', similarly, I am trying to find 'click' and return 'lead session-desc'. Although it does not work for you, my code returns 'container', which is the parent of both elements. – Nick Nov 11 '21 at 06:12
  • A possible solution is to iterate through every tag in the website and search for the keywords within that tag. – Nick Nov 11 '21 at 06:14

2 Answers2

1

As per the comments to get the parent element of an webelement, can use parent keyword in the xpath.

<p> is text node. The parent tag for that element is <div class='page'>

Try like below:

driver.get("https://www.scrapethissite.com/pages/")

listOfKeywords = ['AJAX', 'Click']

for keyword in listOfKeywords:
    try:
        element = driver.find_element_by_xpath("//*[contains(text(),'{}')]".format(keyword))
        parent = element.find_element_by_xpath("./parent::*").get_attribute("class")
        tag_class = element.get_attribute("class")
        print(f"{keyword} : Parent tag class - {parent}, tag class-name - {tag_class}")
    except:
        print("Keyword not found")
AJAX : Parent tag class - page-title, tag class-name - 
Click : Parent tag class - page, tag class-name - lead session-desc
pmadhu
  • 3,373
  • 2
  • 11
  • 23
1

There are two distinct cases as follows:

  • In the first case you can opt to lookout for the keywords in the headings which have a parent <h3> tag with class page-title
  • In the second case you can lookout for the keywords in the <p> tags which have a sibling <h3> tag with class page-title.

For the first usecase to lookout for keywords like AJAX, you can use the following Locator Strategies:

driver.get("https://www.scrapethissite.com/pages/")
listOfKeywords = ['AJAX', 'Ajax']
for keyword in listOfKeywords:
    try:
        print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//a[contains(., '{}')]//parent::h3[1]".format(keyword)))).get_attribute("class"))
    except:
        pass
driver.quit()

For the second usecase to lookout for keywords like Click, you can use the following Locator Strategies:

driver.get("https://www.scrapethissite.com/pages/")
listOfKeywords = ['Click', 'click']
for keyword in listOfKeywords:
    try:
        print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//p[contains(., '{}')]//preceding::h3[1]".format(keyword)))).get_attribute("class"))
    except:
        pass
driver.quit()

In both the cases, the console output will be:

page-title

Update

Combining both the usecase in a single one you can use the following solution:

driver.get("https://www.scrapethissite.com/pages/")
listOfKeywords = ['AJAX', 'Click']
for keyword in listOfKeywords:
    try:
        print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//*[contains(., '{}')]//parent::h3[1]".format(keyword)))).get_attribute("class"))
    except:
        pass
driver.quit()

Console Output:

page-title
page-title
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thanks, the main issue however as stated in the question is that I have the constraint of not knowing what type of tag the keyword is within. As such, I have marked pmadhu as the correct answer as it works solely on the keyword being found. – Nick Nov 11 '21 at 10:20
  • Great, glad that worked for you. Checkout my updated answer and let me know how it looks. – undetected Selenium Nov 11 '21 at 10:34