2

So I found a website that has very cool images and I'd like to scrape some of its data. The website didn't get any update for about 5 years and I tried to contact its owner for some kind of API and I didn't get any response back.

Anyway, the website has categories and each image has it's own page number; so in order to scrape every image, I need to go to each category and then to go to each page of that particular category.

Below is my code, but I can't make the for loop to reset.

const {Cluster} = require('puppeteer-cluster');
const puppeteer = require('puppeteer');

let c = 0;
let z = 500;
(async () => {
    process.setMaxListeners(5);
    const cluster = await Cluster.launch({
        maxConcurrency: 3 // max browsers to spawn at the same time
    });
    let b = 20;
    for (let i = 0; i < b; i++) {
        cluster.execute({i}, async () => {
            let browser = await puppeteer.launch({headless: false});
            // scraping code using the i and c values
            await browser.close();
            console.log(i);
            if (i > b - 10) {
                i = 0;
                c = c + 1;
                console.log('c = ' + c);
                if (c > z)
                    process.exit();
            }
        });
    }
    await cluster.idle();
    await cluster.close();
})();

This is the output (the order isn't necessary):

1
0
2
4
3
5
6
7
8
9
10
11
c = 1
12
c = 2
13
c = 3
14
c = 4
16
c = 5
15
c = 6
17
c = 7
18
c = 8
19
c = 9

Process finished with exit code 0

If I add await in front of cluster.execute then the for loop is resetting, but then I can't use multiple browsers at the same time.

Edit:

const {Cluster} = require('puppeteer-cluster');
const puppeteer = require('puppeteer');

(async () => {
    process.setMaxListeners(5);
    const cluster = await Cluster.launch({maxConcurrency: 3});
    let b = 15;

    let d;

    function myLoop() {
        let g = 0;
        for (g; g <= n; g++) {
            console.log(g);
            myFunc();
        }
        return g;
    }
    d = myLoop();
    console.log('d: ' + d);
    if (d > 0)
        myLoop();

    async function myFunc() {
        await cluster.execute(async () => {
            let browser = await puppeteer.launch({headless: false});
            await browser.close();
        });
    }

    await cluster.idle();
    await cluster.close();
})();
  • 2
    What is resetting a `for` loop? If you're in the middle of an iteration and you find a category that you want to iterate, can't you just do that recursively? – jfriend00 Nov 20 '19 at 15:34
  • I don't think it suffices my workflow because I've to go to different pages to find the hidden categories - which are based on the image presented on that page. I may find the category at the page 6, whereas in other cases, it may be at the page 40. For example, I'm in the category 'Glass'. In this one, I can find an image with a 'glass of water'. The other category presented beneath the image is 'water' and then I continue scraping that category. This is why I need to reset the for loop because the page looks like this: webpage.com/cat=1/pg=1 – doingmybest Nov 20 '19 at 16:03
  • What is "reset the loop"? That is not a standard programming term. If you want to abort a loop, you would typically `break` or `return` and then the code can do something to start a new loop with the proper configuration. – jfriend00 Nov 20 '19 at 16:16
  • What I wanted to say is that I want the ```i``` value to go back to ```0``` so that ```for loop``` can start again. I'm sorry for the bad terms or explanations, but English isn't my mother tongue. – doingmybest Nov 20 '19 at 16:22
  • It looks like you have an asynchronous `cluster.execute()` inside the loop that the loop isn't waiting for, so the loop has gone far, far ahead. You probably need to promisify that and `await` it to make the loop wait for it so if, while it's executing you want to set `i` back to `0`, then the loop won't have already executed all the rest of the loop. – jfriend00 Nov 20 '19 at 16:24
  • I would add that setting the loop variable back to the start value is not a design I would prefer. I'd rather you wrap the loop in a function. When you want to stop a loop iteration and start another one, return out of the function with a return value and have the caller then call the function again, setting up a new loop with the proper input parameters. The control flow will be a lot cleaner to understand than setting a loop index back to 0 and changing some other values around it to cause the loop to now do something different. – jfriend00 Nov 20 '19 at 16:26
  • Author of puppeteer-cluster here. This is not how the library should be used. You are starting a puppeteer instance on your own, which is exactly the job the library does for you. Also, using `execute` in that way makes no sense. Please have a look at the [examples](https://github.com/thomasdondorf/puppeteer-cluster#examples) that I linked in your previous question you asked (which you seem to have deleted) and take a look at the documentation. – Thomas Dondorf Nov 20 '19 at 18:24
  • I'm not ignoring your question, but I provided you links to the documentation and you are still using the library completely wrong. The reason you are giving ("I need to have a (pure) puppeteer instance...") does not make any sense. Read the documentation, look at the examples, provide a MCVE and I can help you. – Thomas Dondorf Nov 20 '19 at 20:07
  • I just said why I need the puppeteer - because of the proxies; and I don't know what isn't MCVE about what I provided because it's Minimal, it's Complete, it's Reproducible and it's an Example. What else should I provide? – doingmybest Nov 20 '19 at 21:57
  • Hi @jfriend00 ! I tried your approach, but I couldn't succeed in achieving any good results so far and I'm pretty sure I'm missing something in your explanation. I've put the for loop inside a function with a return value. Then I had another variable outside the function to call the return value of that function and if that function was equal to a given value, another variable was called to put the loop back on track. Let me know if that was what you were saying if I really missed something. Thanks! – doingmybest Nov 23 '19 at 06:51
  • Add the code for what you tried onto the end of your question so I can see your actual code. I can't fix code I can't see. – jfriend00 Nov 23 '19 at 06:53
  • Sorry for the late reply, but I don't get email notifications. Anyway, I've updated the question. – doingmybest Nov 23 '19 at 10:37
  • So is there anything to be done or did I catch your explanation well? – doingmybest Nov 24 '19 at 01:18

1 Answers1

0

I think the problem might be caused by the let, let is only existing in the current code-block, try making a Function. something like that.

let i = 0; // page index
let n = 0; // number of pages 
let c = 0; // category index
let nc = 0; // number of categorys
for(i = 0; i < n; i++)
{
   for(c = 0; c < nc; c++)
   {
     postrequest(i,c)
   }
} 

postrequest(pageindex,categoryindex)
{
  // Do your async call ... 
}

Sorry if I missed the point of the question....

Ishi
  • 1
  • 1