0

So I am trying to create a basic web scrapper API, and I am using a common function PullArticle() with a nested loop to scrape articles with keywords (different sets of articles and different sets of keywords) depending on the GET request I send.

I need to reset the callback variable "article" between GET requests to keep the data separate, but it just adds on previous calls to each GET request which can duplicate data if I call the same request twice.

I have tried to use a callback function and previously a promise function on advice of StackOverFlow, as I was under the impression that the Topic.forEach function was running asynchronously causing the returned "article" to just return empty; however, I haven't been able to get it to work no matter what, and I was hoping somebody can point out what I'm, doing wrong here.

var article = []

 function PullArticle (Topic, myCallback) {
article =[] // IF I LEAVE THIS RESET OUT ARRAY RETURNS EMPTY :(
  
     Topic.forEach(TopicLoop => {    
       newspapers.forEach(newspapers =>{
                axios.get(newspapers.address) // pulling html
                .then((response)=>{
                    const html = response.data
                    const $ = cheerio.load(html) //allows to pickout elements
                    $(`a:contains(${TopicLoop})`,html).each(function () { 
                        const title = $(this).text()
                        const url = $(this).attr('href')
                        article.push ({
                            title, 
                            url: newspapers.base + url,
                            source: newspapers.name,
                            
                        })
               
                    })
                })
            })
    })
 let sendback = article

myCallback(sendback)
 }

In the same file I make a get request with

app.get('/TopicMatrix1',(req,res) =>{
    PullArticle( Topic1, myDisplayer)
    function myDisplayer (PrintArticle){
        res.json(PrintArticle)
    } 
})
app.get('/SomeOtherTopic',(req,res) =>{
PullArticle()
etc
}

Also does anyone know why I can't make the function myDisplayer(), which prints out res.json a common function sitting outside the GET request, so separate GET requests can call it?

  • If you want to call `myCallback` with the `article` array only when the `axios` promise has resolved, move those two lines of code into the `then()` block (although this may not accomplish what you want). Also, read [this SO answer,](https://stackoverflow.com/questions/37576685/using-async-await-with-a-foreach-loop) it may help you with some of the concepts at play here. – Brendan Bond Jan 02 '22 at 04:06
  • is there a reason you need `article` to be global? if you are reseting it each time why not do `const article = []` in the first line of the function? then just pass article to the callback – about14sheep Jan 02 '22 at 04:11
  • @BrendanBond Thanks for that answer you linked it was very helpful! – Noah Kusaba Jan 02 '22 at 04:30

0 Answers0