1

I am trying to learn how to webscrape news headlines using Python by following along with this post I found: https://medium.com/analytics-vidhya/how-to-scrape-news-headlines-from-reuters-27c0274dc13c

It worked perfectly, however when I tried to emulate it with other newspages, I continue to get the no such element error. I realize that it is because I am choosing the wrong class element within the html, however I don't understand what other class I should be choosing.

The above script was used on this news page:https://www.reuters.com/news/archive/technologynews?view=page&page=6&pageSize=10

I attempted to use it on the following pages, specifically looking into a local state agency:

https://www.startribune.com/search/?page=1&q=%22Department%20of%20Human%20Services%22&refresh=true

https://www.twincities.com/?s=%22Department+of+Human+Services%22&orderby=date&order=desc

Here is the code, the only changes of which are replacing the reuters webpage with the 1st of the ones I am looking into and replacing the class element for the button selection:

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import dateutil.parser
import time
import csv
from datetime import datetime
import io
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.startribune.com/search/?page=1&q=%22Department%20of%20Human%20Services%22&refresh=true')

count = 0
headlines =[]
dates = []
for x in range(500):    
    try:
        # loadMoreButton.click()
        # time.sleep(3)
        loadMoreButton = driver.find_element_by_class_name("pagination-shortcut-link")
        # driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
        time.sleep(3)
        loadMoreButton.click()
        time.sleep(2)
        news_headlines = driver.find_elements_by_class_name("story-title")
        news_dates = driver.find_elements_by_class_name("timestamp")
        for headline in news_headlines:
            headlines.append(headline.text)
            print(headline.text)
        for date in news_dates:
            dates.append(date.text)
            print(date.text)
            count=count+1
            print("CLICKED!!:")
    except Exception as e:
        print(e)
        break

To get the class name I right clicked on the next button to and selected inspect element and copied what I saw. However I continue to get the error. I am not really sure what other class element I am meant to be using.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • If one selector isn't working for you, maybe try using another (XPath, tag name, etc.) – emporerblk Sep 24 '20 at 14:04
  • 1
    I am pretty new to all of this so I am not really sure what you mean by that. Should I read the documentation for the find_element_by_class_name function? Do you know of any pages that goes into detail about what you mentioned? – Barayjeemqaf Sep 24 '20 at 14:10

2 Answers2

1

Classes names change for every webpage you visit cause it's the developer himself choosing the name for a specific WebElement, you should first understand basic HTML before using Selenium, at least. You must often if not always change your code when you change the webpage to scrape. I suggest you also to never rely on IDs too when using Selenium cause developers can change them in need and, for example, Google websites have an algorithm to do so, so your code may work now but not the next day even on the same webpage. Better to check for static text inside elements or something that is not gonna probably change in a short matter of time. If for example you need to click a button with "Next" written on it, scrape all the buttons in the webpage and loop trough them checking for the one with "Next" text and click() on it.

Check my answer here: How to click on the Ask to join button within https://meet.google.com using Selenium and Python?

1

To click on the linkfor the next page you need to induce WebDriverWait for the element_to_be_clickable() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.get('https://www.startribune.com/search/?page=1&q=%22Department%20of%20Human%20Services%22&refresh=true')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "li.pagination-list-item.is-selected +li > a"))).click()
    
  • Using XPATH:

    driver.get('https://www.startribune.com/search/?page=1&q=%22Department%20of%20Human%20Services%22&refresh=true')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//li[@class='pagination-list-item is-selected']//following::li[1]/a"))).click()
    
  • Note: You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
  • Browser Snapshot:

page2


References

You can find a couple of relevant discussions on NoSuchElementException in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thank you for the reply! Just for clarification, your suggestions will also require further revision to the loadMoreButton variable later on? I tried merely adding the WebDriverWait line initially; it loads the 2nd page but the script stops there without having read any of the headlines from the first page. I've also tried changing the class element selected within the loadMoreButton with similar results. – Barayjeemqaf Sep 25 '20 at 14:46