0

I'm writing a crawler module which is calling it-self recursively to download more and more links depending on a depth option parameter passed.

Besides that, I'm doing more tasks on the returned resources I've downloaded (enrich/change it depending on the configuration passed to the crawler).
This process is going on recursively until it's done which might take a-lot of time (or not) depending on the configurations used.

I wish to optimize it to be as fast as possible and not to hinder on any Node.js application that will use it.
I've set up an express server that one of its routes launch the crawler for a user defined (query string) host.
After launching a few crawling sessions for different hosts, I've noticed that I can sometimes get real slow responses from other routes that only return simple text.
The delay can be anywhere from a few milliseconds to something like 30 seconds, and it's seems to be happening at random times (well nothing is random but I can't pinpoint the cause).
I've read an article of Jetbrains about CPU profiling using V8 profiler functionality that is integrated with Webstorm, but unfortunately it only shows on how to collect the information and how to view it, but it doesn't give me any hints on how to find such problems, so I'm pretty much stuck here.

Could anyone help me with this matter and guide me, any tips on what could hinder the express server that my crawler might do (A lot of recursive calls), or maybe how to find those hotspots I'm looking for and optimize them?

Jorayen
  • 1,737
  • 2
  • 21
  • 52

1 Answers1

0

It's hard to say anything more specific on how to optimize code that is not shown, but I can give some advice that is relevant to the described situation.

One thing that comes to mind is that you may be running some blocking code. Never use deep recursion without using setTimeout or process.nextTick to break it up and give the event loop a chance to run once in a while.

rsp
  • 107,747
  • 29
  • 201
  • 177
  • I figured someone would say that but than this question would be really specific for my code and could not help other people who might stumble upon this question. If you have any idea of where I could ask such a question and be more specific to provide my code it would be great, but it's not practical in stackoverflow because it's not just a one or two files module, and I don't want to tell people here's the code look at it and tell me what's wrong with it. – Jorayen Sep 23 '16 at 14:47
  • Also I figured I should use timers to let the event loop a chance to run, but what I can't figure out is what pieces of code to target to wrap them inside timers. – Jorayen Sep 23 '16 at 14:48
  • @Jorayen this [existing stackoverflow question](http://stackoverflow.com/questions/25568613/node-js-event-loop) might answer a thing or two. – Gimby Sep 23 '16 at 14:49
  • @Gimby Thanks for sharing. All of this I already know, I know I should not block the event loop by using timers (don't know what to wrap in them exactly in my crawler to give the best performance, generally I know high CPU bound tasks, but in my crawler there're lots of calculations going on in nested loops anyway, so does that mean I should wrap everything ? can't tell.), I also use async version of any I/O function that in use, and I use async as a control flow library. – Jorayen Sep 23 '16 at 14:58
  • @Jorayen "I can't figure out is what pieces of code to target" - not having seen even a single piece of code that you're talking about, all I can say is: target all of them. :) – rsp Sep 23 '16 at 15:07
  • @rsp I want you to be able to see the code but is it the right place to put a link to the repository? if not where could we move this discussion to ? – Jorayen Sep 23 '16 at 15:19
  • @rsp Okay I think I found the problem with my code (through logging tho not profiling). I have one loop which operates on N number of items and do heavy string operations on each item, when this loop is iterating over a large set, the program is halting. Even if I try to move it to the event loop using timers, it will still block the execution of the application when it executes because it takes some times. What are my solutions in such a case? Is launching a new process the only solution ? – Jorayen Sep 23 '16 at 23:54