0

I wrote a web crawler with nodejs to send get requests to about 300 urls. Here is the main loop:

for (let i = 1; i <= 300; i++) { 
    let page= `https://xxxxxxxxx/forum-103-${i}.html`
    await getPage(page,(arr)=>{
        console.log(`page ${i}`)
    })
}

Here is the function getPage(url,callback):

export default async function getPage(url, callback) {
    await https.get(url, (res) => {
        let html = ""
        res.on("data", data => {
            html += data
        })
        res.on("end", () => {
            const $ = cheerio.load(html)
            let obj = {}
            let arr = []
            obj = $("#threadlisttableid tbody")
            for (let i in obj) {
                if (obj[i].attribs?.id?.substr(0, 6) === 'normal') {
                    arr.push(`https://xxxxxxx/${obj[i].attribs.id.substr(6).split("_").join("-")}-1-1.html`)
                }
            }
            callback(arr)
            console.log("success!")
        })
    })
        .on('error', (e) => {
            console.log(`Got error: ${e.message}`);
        })
}

I use cheerio to analyze HTML and put all information i need to variable nameed 'arr'. The program will report an error after running normally for a period of time,like that:

...
success!
page 121
success!
page 113
success!
page 115
success!
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443

I have two questions:

1.What is the reason for the error? Is it because I am sending too many get requests? How can I limit the request frequency?

2.As you can see, The order in which the pages are accessed is chaotic,how to control them?

I have tried using other modules to send get request (such as Axios) but it didn't work.

2 Answers2

0

As you can see, The order in which the pages are accessed is chaotic,how to control them?

await is meaningless unless you put a promise on the right hand side. http.get does not deal in promises.

You could wrap it in a promise but it would be easier to use an API which supports then natively such as node-fetch, axios, or Node.js's native fetch. (That all have APIs that are, IMO, easier to use than http.get in general nor just with regards to flow control).

What is the reason for the error?

It isn't clear.

Is it because I am sending too many get requests?

That is a likely hypothesis.

How can I limit the request frequency?

Once you have your for loop working with promises so the requests are sent in serial instead of parallel, you can insert a sleep between each request.

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • Thank you for your answer, I have found the cause of the error. As you suggested, I added sleep() to the loop and now the program works fine. – Irene Adle Mar 29 '22 at 12:06
0

The http requests are fired simultaneously because the loop is not waiting for the previous request due to wrong use of await. Proper control of loop will limit the request frequency.


for (let i = 1; i <= 300; i++) { 
    let page= `https://xxxxxxxxx/forum-103-${i}.html`
    var arr = await getPage(page);
    // use arr in the way you want
    console.log(`page ${i}`);
}

export default async function getPage(url) {
    // Declare a new promise, wait for the promise to resolve and return its value.
    return await new Promise((reso, rej) => {
        https.get(url, (res) => {
            let html = ""
            res.on("data", data => {
                html += data
            })
            res.on("end", () => {
                const $ = cheerio.load(html)
                let obj = {}
                let arr = []
                obj = $("#threadlisttableid tbody")
                for (let i in obj) {
                    if (obj[i].attribs?.id?.substr(0, 6) === 'normal') {
                        arr.push(`https://xxxxxxx/${obj[i].attribs.id.substr(6).split("_").join("-")}-1-1.html`)
                    }
                }
                reso(arr) // Resolve with arr
                console.log("success!")
            })
        })
        .on('error', (e) => {
            console.log(`Got error: ${e.message}`);
            throw e;
        })
    })
}
shunz19
  • 495
  • 4
  • 13
  • Thank you for clearly specifying the cause of the error.In fact, how to get the value of `arr` has troubled me for a long time.Before, I pass in callback to operate `arr` as a last resort. Wait for the promise to return its value is a more elegant choice.But I found a little typo, is `reso(arr)` not `res(arr)`.Thank you again! – Irene Adle Mar 29 '22 at 12:43
  • @IreneAdle reso is not a typo, it's a resolve function of the `new Promise((reso, rej)`. Glad to help you! Welcome to the stackoverflow community – shunz19 Mar 29 '22 at 22:51
  • Sorry, maybe I didn't express clearly.I mean the previous line of `console.log("success!")` might be `reso(arr) //Resolve with arr` instead of `res(arr) //Resolve with arr` . ^-^ – Irene Adle Mar 30 '22 at 11:55
  • @IreneAdle woops, my bad! Updated my answer – shunz19 Mar 31 '22 at 12:08