0

I was trying to do webscraping on Reuters for nlp analysis and most of it is working, but I am unable to get the code to click the "load more" button for more news articles. Below is the code currently being used:

import csv
import time
import pprint
from datetime import datetime, timedelta
import requests
import nltk
nltk.download('vader_lexicon')
from urllib.request import urlopen
from bs4 import BeautifulSoup
from bs4.element import Tag

comp_name = 'Apple'
url = 'https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all'

res = requests.get(url.format(1))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.find_all("h3",{"class":"search-result-title"}):
    s = str(item)
    article_addr = s.partition('a href="')[2].partition('">')[0]
    headline = s.partition('a href="')[2].partition('">')[2].partition('</a></h3>')[0]
    article_link = 'https://www.reuters.com' + article_addr

    try:
        resp = requests.get(article_addr)
    except Exception as e:
        try:
            resp = requests.get(article_link)
        except Exception as e:
            continue

    sauce = BeautifulSoup(resp.text,"lxml")
    dateTag = sauce.find("div",{"class":"ArticleHeader_date"})
    contentTag = sauce.find("div",{"class":"StandardArticleBody_body"})

    date = None
    title = None
    content = None

    if isinstance(dateTag,Tag):
        date = dateTag.get_text().partition('/')[0]
    if isinstance(contentTag,Tag):
        content = contentTag.get_text().strip()
    time.sleep(3)
    link_soup = BeautifulSoup(content)
    sentences = link_soup.findAll("p")
    print(date, headline, article_link)

from selenium import webdriver
from selenium.webdriver.common.keys import keys
import time

browser = webdriver.Safari()
browser.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')
try:
    element = WebDriverWait(browser, 3).until(EC.presence_of_element_located((By.ID,'Id_Of_Element')))
except TimeoutException: 
    print("Time out!") 
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352

2 Answers2

3

To click the element with text as LOAD MORE RESULTS you need to induce WebDriverWait for the element_to_be_clickable() and you can use the following Locator Strategies:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    
    comp_name = 'Apple'
    driver.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')
    while True:
        try:
            driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='search-result-more-txt']"))))
            WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='search-result-more-txt']"))).click()
            print("LOAD MORE RESULTS button clicked")
        except TimeoutException:
            print("No more LOAD MORE RESULTS button to be clicked")
            break
    driver.quit()
    
  • Console Output:

    LOAD MORE RESULTS button clicked
    LOAD MORE RESULTS button clicked
    LOAD MORE RESULTS button clicked
    .
    .
    No more LOAD MORE RESULTS button to be clicked
    

Reference

You can find a relevant detailed discussion in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

To click on LOAD MORE RESULTS induce WebDriverWait() and element_to_be_clickable()

Use while loop and check the counter<11 to click on 10 times.

I have tested on Chrome since I don't have safari browser however it should work too.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

comp_name="Apple"
browser = webdriver.Chrome()
browser.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')

#Accept the trems button
WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button#_evidon-banner-acceptbutton"))).click()
i=1
while i<11:
     try:
        element = WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.XPATH,"//div[@class='search-result-more-txt' and text()='LOAD MORE RESULTS']")))
        element.location_once_scrolled_into_view
        browser.execute_script("arguments[0].click();", element)
        print(i)
        i=i+1

     except TimeoutException:
            print("Time out!")
KunduK
  • 32,888
  • 5
  • 17
  • 41
  • TimeoutException Traceback (most recent call last) in 9 10 #Accept the trems button > 11 WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button#_evidon-banner-acceptbutton"))).click() 12 i=1 13 while i<11: 78 if time.time() > end_time: 79 break > 80 raise TimeoutException(message, screen, stacktrace) 81 82 def until_not(self, method, message=''): TimeoutException: Message: – web_scraper123 Jan 06 '20 at 19:13
  • Thank you for your reply, however I keep getting the error message above^ – web_scraper123 Jan 06 '20 at 19:14
  • I am not sure if you are getting that pop up on safari or not can you comment this line and check.i am getting that pop up when interacting with chrome hence added it you can comment that line of code. – KunduK Jan 06 '20 at 19:19
  • I'm also wondering if there is any advice to incorporate the results after the "load more results" button clicking into my previous code? Since the url wouldn't change after more results loaded, the following 2 lines still has the same content: res = requests.get(url.format(1)) soup = BeautifulSoup(res.text,"lxml") Could you please kindly suggest? – web_scraper123 Jan 07 '20 at 16:39