I am extracting news items from a certain category page of a website in which there is no load-more button; instead, the links to news stories are produced as I scroll down. I created a function that accepts category page url and limit page (the number of times I want to scroll down) as inputs and returns all the links to the news items shown on that page. It was running properly earlier recently I found in each page each news links are in different class which is making very difficult to retirve them back all together. Maybe I'm wrong and I do believe there would be a simpler method.
Category page link = https://www.scmp.com/topics/currencies
This was my try!
def get_article_links(url, limit_loading):
options = webdriver.ChromeOptions()
lists = ['disable-popup-blocking']
caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "normal"
options.add_argument("--window-size=1920,1080")
options.add_argument("--disable-extensions")
options.add_argument("--disable-notifications")
options.add_argument("--disable-Advertisement")
options.add_argument("--disable-popup-blocking")
driver = webdriver.Chrome(executable_path= r"E:\chromedriver\chromedriver.exe", options=options) #add your chrome path
driver.get(url)
last_height = driver.execute_script("return document.body.scrollHeight")
loading = 0
end_div = driver.find_element('class name','topic-content__load-more-anchor')
while loading < limit_loading:
loading += 1
print(f'scrolling to page {loading}...')
end_div.location_once_scrolled_into_view
time.sleep(2)
article_links = []
bsObj = BeautifulSoup(driver.page_source, 'html.parser')
for i in bsObj.find('div', {'class': 'content-box'}).find('div', {'class': 'topic-article-container'}).find_all('h2', {'class': 'article__title'}):
article_links.append(i.a['href'])
return article_links
Assuming I want to scroll 3 times in this category page and get back all the links on those pages.
get_article_links('https://www.scmp.com/topics/currencies', 3)
But it is neither scrolling nor getting me back the links as the problem faced by me. Any help with this will be really appreciated. Thanks~~