0

I'm trying to parse webpages recursive by phantomjs.

for example:

WebPage:
 link1,
 link2,
 link3,
 link4,
 link5
 nextPage

what i'm doing with this page:

var parsePage = function(links) {

    // parse everyone link
    for(var i = 0; i < posts.length; i++ )
        parsePost(links[i]);
};

parsePost - i'm getting some information from page, like getting all emails and phones by regex, which take a lot of time

but phantomjs (js) is asynchronous, and not waiting while it'll parse everyone link, and then goes to nextPage. it works a bit another:

- parsing page1
  - parsing link1
  - parsing link2
   ....
  - parsing link5
- parsing page2
  - parsing link1
   ....
  - parsing link5

  -> and just now are comes results to console from parsed page1 -> link1
  .....
- parsing page3

so it takes my 6gb pc memory at 3 minutes :DDD

how can i solve this problem?

i was trying to do:

 1. mb limit program memory use? ( it'll wait while some processes finished and then it continue to parse another pages ? )
 2. i was trying to do like :

> page.open(link, function(... here is pageparser ( wich parsing everyone link))
and then page.close()

but pageparser takes a lot of time, so when i use page.close -> it stop pageparser process.
vromanch
  • 939
  • 10
  • 22

1 Answers1

1

I think you should design your javascript for phantomjs as suggested/answered in this other post on stackoverflow suggests. I did it that way and it worked just fine.

Community
  • 1
  • 1
scrat.squirrel
  • 3,607
  • 26
  • 31