0

I am extracting news items from a certain category page of a website in which there is no load-more button; instead, the links to news stories are produced as I scroll down. I created a function that accepts category page url and limit page (the number of times I want to scroll down) as inputs and returns all the links to the news items shown on that page. It was running properly earlier recently I found in each page each news links are in different class which is making very difficult to retirve them back all together. Maybe I'm wrong and I do believe there would be a simpler method.

Category page link = https://www.scmp.com/topics/currencies

This was my try!

def get_article_links(url, limit_loading):
    
    options = webdriver.ChromeOptions()
    
    lists = ['disable-popup-blocking']

    caps = DesiredCapabilities().CHROME
    caps["pageLoadStrategy"] = "normal"

    options.add_argument("--window-size=1920,1080")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-notifications")
    options.add_argument("--disable-Advertisement")
    options.add_argument("--disable-popup-blocking")
    
    driver = webdriver.Chrome(executable_path= r"E:\chromedriver\chromedriver.exe", options=options) #add your chrome path

    
    driver.get(url)
    last_height = driver.execute_script("return document.body.scrollHeight")
    
    loading = 0
    end_div = driver.find_element('class name','topic-content__load-more-anchor')
    while loading < limit_loading:
        loading += 1
        print(f'scrolling to page {loading}...')        
        end_div.location_once_scrolled_into_view
        time.sleep(2)
        
        
    article_links = []
    bsObj = BeautifulSoup(driver.page_source, 'html.parser')
    for i in bsObj.find('div', {'class': 'content-box'}).find('div', {'class': 'topic-article-container'}).find_all('h2', {'class': 'article__title'}):
        article_links.append(i.a['href'])
    
    return article_links

Assuming I want to scroll 3 times in this category page and get back all the links on those pages.

get_article_links('https://www.scmp.com/topics/currencies', 3)

But it is neither scrolling nor getting me back the links as the problem faced by me. Any help with this will be really appreciated. Thanks~~

Starlord22
  • 111
  • 5
  • I can't seem to find an elements with a CSS class `topic-content__load-more-anchor`. Are you sure that's right? – Nick ODell Feb 16 '23 at 02:48
  • If you open the website in a browser and do a right click -> inspect on a location in the website, it will bring up an elements sub menu where you can inspect the html and css. Looking at your current classes you are search on and comparing the website, it appears the website has changed up the css class names, you will need to track down the new classes you are scraping on and update them. – Nolan Walker Feb 16 '23 at 03:16

1 Answers1

1

This one took a while to solve!

Apparently there is no way of scrolling to the bottom of the page if you don't click it first. And it's also much easier to just send the END key. I got the idea of sending keys from here.

After scrolling, look for the div that contains all articles, its class is 'css-1279nek ef7usvb8'. Then use regex to match all href, including only the webpage address. Lastly use set() to get rid of duplicates.

    driver.get(url)
    elem = driver.find_element_by_xpath("//body")
    for i in range(limit_loading):
        action = ActionChains(driver)
        action.click().perform()
        elem.send_keys(Keys.END)
        time.sleep(5)
        
    bsObj = BeautifulSoup(driver.page_source, 'html.parser')
    article_div = bsObj.find('div', {'class': "css-1279nek ef7usvb8"})
    article_links = list(set(re.findall('href="([\/A-z0-9.-]+)', str(article_div))))