-1

I try to extract from the first page of NYT https://www.nytimes.com the link of each article and the complete content of each article.

To extract the links I can use this example

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.tracing.start({
    path: 'trace.json',
    categories: ['devtools.timeline']
  })
  await page.goto('https://news.ycombinator.com/news')

  // execute standard javascript in the context of the page.
  const stories = await page.$$eval('a.storylink', anchors => { return anchors.map(anchor => anchor.textContent).slice(0, 10) })
  console.log(stories)
  await page.tracing.stop()
  await browser.close()
})()

But I don't know how to extract the content of the articles (text) for each link.

Could you please help me? Thank you!

PS: I searched in all the examples and tutorials over the internet and I didn't find anything to help me.

  • The naive approach is to `goto` each link and scrape the text. Problem is, if you want only the article on each page, there's no obvious one-size-fits-all method to figure out what's article and what's sidebar/ads/titles/whatever other stuff might be on the page. Also, not all links on the NYT are articles. Some point to other parts of the site, ads, who knows. So this seems underspecified--more clarity is necessary. – ggorlen Jul 21 '22 at 02:09
  • I found this solution https://stackoverflow.com/questions/46293216/crawling-multiple-urls-in-a-loop-using-puppeteer – Maria Mercedes Jul 21 '22 at 10:40

2 Answers2

0

Use anchors.map(anchor => anchor.href) for hrefs,

and anchors.map(anchor => anchor.innerText) for text

pguardiario
  • 53,827
  • 19
  • 119
  • 159
0

You can try this ........

const stories = await page.evaluate(() => {
  const list = []
  const news_items = document.querySelectorAll(".relevant-class")

  for (const news_item of news_items) {
    list.push({
      heading: item.querySelector(".relevant_class h3").innerHTML,
      article: item.querySelector(".relevant_class").innerHTML,
     
    })
  }

  return list
})
Anoop D
  • 1,522
  • 18
  • 25