1

I need to be able to scrape the content of many articles of a certain category from the New York Times. For example, let's say we want to look at all of the articles related "terrorism." I would go to this link to view all of the articles: https://www.nytimes.com/topic/subject/terrorism

From here, I can click on the individual links, which directs me to a URL that I can scrape. I am using Python with the BeautifulSoup package to help me retrieve the article text.

Here is the code that I have so far, which lets me scrape all of the text from one specific article:

from bs4 import BeautifulSoup

session = requests.Session()
url = "https://www.nytimes.com/2019/10/23/world/middleeast/what-is-going-to-happen-to-us-inside-isis-prison-children-ask-their-fate.html"
req = session.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
paragraphs = soup.find_all('p')

for p in paragraphs:
    print(p.get_text())

The problem is, I need to be able to scrape all of these articles under the category, and I'm not sure how to do that. Since I can scrape one article as long as I am given the URL, I would assume my next step is to find a way to gather all of the URLs under this specific category, and then run my above code on each of them. How would I do this, especially given the format of the page? What do I do if the only way to see more articles is to manually select the "SHOW MORE" button at the bottom of the list? Are these capabilities that are included in BeautifulSoup?

Jeffubert
  • 11
  • 2
  • 1
    I once created something simular, but I had a list of wikipedia bullets that you can get the href location from, follow that link, then get the content from that page. So yes, you should look for a page, and search for a specific query so that you can get all the links on the main page, and then get each of those links their contents. I did some looking around and I see the 'css-ye6x8s' class that is the same for all boxes of news on the terrorist page. Query for that and get the href attribute of the underlying a tag. – NLxDoDge Nov 05 '19 at 14:04
  • 1
    @NLxDoDge The problem is how to invoke the javascript that loads additional articles in the DOM. – Jonathan Scholbach Nov 05 '19 at 14:09
  • Here is another overflow with some answer, might wanna take a look there. https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python#26440563 – NLxDoDge Nov 05 '19 at 14:13

1 Answers1

0

You're probably going to want to put a limit to how many articles you want to pull at a time. I clicked the Show More button a handful of times for the terrorism category and it just keeps going.

To find the links, you need to analyze the html structure and find patterns. In this case, each article preview is in a list element with class = "css-13mho3u". However I checked another category and this class pattern won't be consistent to other ones. But you can see that these list elements are all under an ordered list element which class = "polite" and this is consistent to other news categories.

Under each list category, there is one link that will link to the article. So you simply have to grab it and extract the href. Your code can look something like this:

ol = soup.find('ol', {'class':'polite'})
lists = ol.findAll('li')
for list in lists:
    link = list.find('a')
    url = link['href']

To click on the Show More button you'll need to use additional tools outside of beautiful soup. You can use Selenium webdriver to click it to open up the next page. You can follow the top answer at this SO question to learn to do that.

Joseph Rajchwald
  • 487
  • 5
  • 13