5

I am trying to scrape the following page with requests and BeautifulSoup/Lxml

https://www.reuters.com/search/news?blob=soybean&sortBy=date&dateRange=all

This is the kind of pages that have a load more results button. I have found few pages explaining how to do so, but not within the frame of requests.

I understand that I should spend few more hours researching that problem before resorting to asking here, as to show proof that I ve tried.

I've tried to look into the inspect pane, into the network tab etc. but i'm still a bit too fresh with requests to understand how to interact with javascript.

I don't need a fully blown script/solution as an answer, just some pointers as to how to do this very typical task with requests, as to save me few precious hours of research.

Thanks in advance.

jim jarnac
  • 4,804
  • 11
  • 51
  • 88
  • Sorry guys I had a wrong title before – jim jarnac Jan 27 '18 at 16:23
  • `requests` won't run js, you'll need `selenium` for that – t.m.adam Jan 27 '18 at 16:29
  • 2
    if you open network tab in developer tools in your browser and click load more results you'll see that button triggers ajax request to url `https://www.reuters.com/assets/searchArticleLoadMoreJson?blob=soybean&bigOrSmall=big&articleWithBlog=true&sortBy=date&dateRange=all&numResultsToShow=10&pn=4&callback=addMoreNewsResults` where pn=4 is page number. And responce is in json format. It also triggers some ping event, probably to make sure that ajax is running in browser (to bock automatic scrapers). So requests might work, but selenium or phantomjs is better choice – mugiseyebrows Jan 27 '18 at 18:07
  • @mugiseyebrows, yeah i saw that too, but it won't be easy to use the data it returns. It seems like the data in `addMoreNewsResults` is not json (no double quotes in keys), but a js dict. – t.m.adam Jan 27 '18 at 18:21
  • https://stackoverflow.com/a/26900181/2079189 – mugiseyebrows Jan 27 '18 at 18:37
  • @mugiseyebrows, yes, it's not impossible, but it would be easier to use `selenium`. – t.m.adam Jan 27 '18 at 18:44

1 Answers1

8

Here's a quick script should show how this can be done with Selenium:

from selenium import webdriver
import time

url = "https://www.reuters.com/search/news?blob=soybean&sortBy=date&dateRange=all"
driver = webdriver.PhantomJS()
driver.get(url)
html = driver.page_source.encode('utf-8')
page_num = 0

while driver.find_elements_by_css_selector('.search-result-more-txt'):
    driver.find_element_by_css_selector('.search-result-more-txt').click()
    page_num += 1
    print("getting page number "+str(page_num))
    time.sleep(1)

html = driver.page_source.encode('utf-8')

I don't know how to do this with requests. There seems to be lots of articles about soybeans on Reuters. I've already done over 250 "page loads" as I finish writing this answer.

Once you scrape all, or some large amount of pages, you can then scrape the data by passing html into Beautiful Soup:

soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', attrs={"class":'search-result-indiv'})
articles = [a.find('a')['href'] for a in links if a != '']
briancaffey
  • 2,339
  • 6
  • 34
  • 62
  • 2
    I am getting an error that Selenium no longer works with PhantomJS "C:\Users\xxx\Anaconda3\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '" and it won't run. Any suggestions? – Steve Gon Mar 12 '18 at 18:52
  • @briancaffey how to limit the maximum page number, because the website detects it as a robot – lockey Oct 27 '21 at 00:55