I'm building a web scraper in nodeJS
that uses request
and cheerio
to parse the DOM. While I am using node
, I believe this is more of a general javascript
question.
tl;dr - creating ~60,000 - 100,000 objects, uses up all my computer's RAM, get an out of memory
error in node.
Here's how the scraper works. It's loops within loops, I've never designed anything this complex before so there might be way better ways to do this.
Loop 1: Creates 10 objects in array called 'sitesArr'. Each object represents one website to scrape.
var sitesArr = [
{
name: 'store name',
baseURL: 'www.basedomain.com',
categoryFunct: '(function(){ // do stuff })();',
gender: 'mens',
currency: 'USD',
title_selector: 'h1',
description_selector: 'p.description'
},
// ... x10
]
Loop 2: Loops through 'sitesArr'. For each site it goes to the homepage via 'request' and gets a list of category links, usually 30-70 URLs. Appends these URLs to the current 'sitesArr' object to which they belong, in an array property whose name is 'categories'.
var sitesArr = [
{
name: 'store name',
baseURL: 'www.basedomain.com',
categoryFunct: '(function(){ // do stuff })();',
gender: 'mens',
currency: 'USD',
title_selector: 'h1',
description_selector: 'p.description',
categories: [
{
name: 'shoes',
url: 'www.basedomain.com/shoes'
},{
name: 'socks',
url: 'www.basedomain.com/socks'
} // x 50
]
},
// ... x10
]
Loop 3: Loops through each 'category'. For each URL it gets a list of products links and puts them in an array. Usually ~300-1000 products per category
var sitesArr = [
{
name: 'store name',
baseURL: 'www.basedomain.com',
categoryFunct: '(function(){ // do stuff })();',
gender: 'mens',
currency: 'USD',
title_selector: 'h1',
description_selector: 'p.description',
categories: [
{
name: 'shoes',
url: 'www.basedomain.com/shoes',
products: [
'www.basedomain.com/shoes/product1.html',
'www.basedomain.com/shoes/product2.html',
'www.basedomain.com/shoes/product3.html',
// x 300
]
},// x 50
]
},
// ... x10
]
Loop 4: Loops through each of the 'products' array, goes to each URL and creates an object for each.
var product = {
infoLink: "www.basedomain.com/shoes/product1.html",
description: "This is a description for the object",
title: "Product 1",
Category: "Shoes",
imgs: ['http://foo.com/img.jpg','http://foo.com/img2.jpg','http://foo.com/img3.jpg'],
price: 60,
currency: 'USD'
}
Then, for each product object I'm shipping them off to a MongoDB function which does an upsert
into my database
THE ISSUE
This all worked just fine, until the process got large. I'm creating about 60,000 product objects every time this script runs, and after a little while all of my computer's RAM is being used up. What's more, after getting about halfway through my process I get the following error in Node
:
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
I'm very much of the mind that this is a code design issue. Should I be "deleting" the objects once I'm done with them? What's the best way to tackle this?