2

I'm building a web scraper in nodeJS that uses request and cheerio to parse the DOM. While I am using node, I believe this is more of a general javascript question.

tl;dr - creating ~60,000 - 100,000 objects, uses up all my computer's RAM, get an out of memory error in node.

Here's how the scraper works. It's loops within loops, I've never designed anything this complex before so there might be way better ways to do this.

Loop 1: Creates 10 objects in array called 'sitesArr'. Each object represents one website to scrape.

var sitesArr = [
    {
        name: 'store name',
        baseURL: 'www.basedomain.com',
        categoryFunct: '(function(){ // do stuff })();',
        gender: 'mens', 
        currency: 'USD',
        title_selector: 'h1',
        description_selector: 'p.description'
    },
    // ... x10
]

Loop 2: Loops through 'sitesArr'. For each site it goes to the homepage via 'request' and gets a list of category links, usually 30-70 URLs. Appends these URLs to the current 'sitesArr' object to which they belong, in an array property whose name is 'categories'.

var sitesArr = [
    {
        name: 'store name',
        baseURL: 'www.basedomain.com',
        categoryFunct: '(function(){ // do stuff })();',
        gender: 'mens', 
        currency: 'USD',
        title_selector: 'h1',
        description_selector: 'p.description',
        categories: [
                        {
                            name: 'shoes',
                            url: 'www.basedomain.com/shoes'
                        },{
                            name: 'socks',
                            url: 'www.basedomain.com/socks'
                        } // x 50
                    ]
    },
    // ... x10
]

Loop 3: Loops through each 'category'. For each URL it gets a list of products links and puts them in an array. Usually ~300-1000 products per category

var sitesArr = [
    {
        name: 'store name',
        baseURL: 'www.basedomain.com',
        categoryFunct: '(function(){ // do stuff })();',
        gender: 'mens', 
        currency: 'USD',
        title_selector: 'h1',
        description_selector: 'p.description',
        categories: [
                        {
                            name: 'shoes',
                            url: 'www.basedomain.com/shoes',
                            products: [
                                'www.basedomain.com/shoes/product1.html',
                                'www.basedomain.com/shoes/product2.html',
                                'www.basedomain.com/shoes/product3.html',
                                // x 300
                            ]
                        },// x 50
                    ]
    },
    // ... x10
]

Loop 4: Loops through each of the 'products' array, goes to each URL and creates an object for each.

var product = {
    infoLink: "www.basedomain.com/shoes/product1.html",
    description: "This is a description for the object",
    title: "Product 1",
    Category: "Shoes",
    imgs: ['http://foo.com/img.jpg','http://foo.com/img2.jpg','http://foo.com/img3.jpg'],
    price: 60,
    currency: 'USD'
}

Then, for each product object I'm shipping them off to a MongoDB function which does an upsert into my database

THE ISSUE

This all worked just fine, until the process got large. I'm creating about 60,000 product objects every time this script runs, and after a little while all of my computer's RAM is being used up. What's more, after getting about halfway through my process I get the following error in Node:

 FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

I'm very much of the mind that this is a code design issue. Should I be "deleting" the objects once I'm done with them? What's the best way to tackle this?

JVG
  • 20,198
  • 47
  • 132
  • 210
  • 1
    it seems that "there is a better way to do it" than loops within loops, for example, you should use queues and could spawn processes to handle other stuff. since the scraper literally eats your RAM, instead of buying even more ram, try the "there is a better way" approach – Gntem Sep 07 '13 at 13:40
  • @GeoPhoenix Can you recommend any resources for learning this? I'm new to coding beyond basic websites so not sure where to start. – JVG Sep 08 '13 at 00:16
  • 2
    i think large size object should be set to null after their use to free memory as far given in this answer http://stackoverflow.com/a/5733714/988078 – Vivek Bajpai Sep 30 '13 at 14:41

0 Answers0