-1

I tried using NodeJS in a server-side script to parse the text content in local PDF files using pdf-parse, which in turn uses Mozilla's amazing PDF parser. Everything worked wonderfully in my dev sandbox, but the whole thing came crashing down on me when I attempted to use the same code in production.

My problem was caused by the sheer number of PDF files I'm trying to process asynchronously: I have more than 100K files that need processing, and Mozilla's PDF parser is (understandably) unconditionally asynchronous – the OS killed my node process because of too many open files. I had started by writing all of my code asynchronously (the preliminary part where I search for PDF files to parse), but even after refactoring all the code for synchronous operation, it still kept crashing.

The gist of the problem is related to the cost of the operations: walking the folder structure to look for PDF files is cheap, whereas actually opening the files, reading their contents and parsing them is expensive. So Node kept generating new promises for each file it encountered, and the promises were never fulfilled. If I tried to run the code manually on smaller folders, it worked like a charm – really fast and reliable. As soon as I tried to execute the code on the entire folder structure it crashed, no matter what.

I know Node enthusiasts always answer questions like these by saying the OP is using the wrong programming pattern, but I'm stumped as to what would be the correct pattern in this case.

Bogdan Stăncescu
  • 5,320
  • 3
  • 24
  • 25

1 Answers1

1

You need to control how many simultaneous asynchronous operations you start at once. This is under your control. You don't show your code so we can just advise conceptually.

For example, if you look at this answer:

Promise.all consumes all my RAM

It shows a function called mapConcurrent() that iterates an array calling an asynchronous function that returns a promise with a maximum number of async operations "in flight" at any given time. You can tune that number of concurrent operations based on your situation.

Another implementation here:

Make several requests to an API that can only handle 20 request a minute

with a function call pMap() that does something similar.

There are other such implementations built into libraries such as Bluebird and Async-promises.

jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • I'd also recommend `async-q` for promise based code or the venerable `caolan/async` for callback based code – slebetman Nov 16 '19 at 22:52
  • @BogdanStăncescu - Does this answer your question? – jfriend00 Nov 17 '19 at 01:56
  • I think so, yes; it's just that I didn't get a chance to test it yet. It's something I'm genuinely interested in, and I will certainly test it – and I won't forget to accept your answer. :) – Bogdan Stăncescu Nov 19 '19 at 19:01
  • I've been able to use @slebetman's suggestion, and even extend further by using npm's `async` – and it worked. I'm sure I could use the code in your examples as well. I just don't understand the mechanics of it all right now. – Bogdan Stăncescu Nov 19 '19 at 22:45
  • @BogdanStăncescu - Well, `mapConcurrent()` in my first link is just like an asynchronous `.map()`. You pass it an array, a max number of concurrent asynchronous operations you want in flight at a time (this is the control you need) and a function that will get called for each item in the array and returns a promise that resolves to whatever eventual value you want for that item in the array. The function itself returns a promise that resolves to an array of values (just like `.map()`). So, it's just an asynchronous `.map()`. – jfriend00 Nov 19 '19 at 23:27
  • @BogdanStăncescu - I'm curious what part of using `mapConcurrent()` did you not understand? Or, did you just want a pre-packaged solution? – jfriend00 Nov 20 '19 at 00:12
  • IMHO `async-q` is more feature complete than `async-promises` even though it is 4 years older – slebetman Nov 20 '19 at 00:32
  • @slebetman, `async-promises` is a library I hadn't been aware of; I meant `async`. While `async-q` might be more feature-rich, that is not the first concern when deciding upon a library, unless one actually needs the extra features. One mainly looks for maintainability, and `async-q` certainly lacks in that regard, at least when compared to `async`. But enough said; I thank you for suggesting that train of investigation, because it helped me develop a POC based on `async` – but I won't be using either, anyway. – Bogdan Stăncescu Nov 20 '19 at 00:55
  • @jfriend00, yes, I figured it out, after all. And it totally works, for all intents and purposes. It's just that it doesn't satisfy my purist approach (which I apologise for in advance). I'm an old timer who expects a computer program to be coercible into a Turing machine which can take an infinite amount of input, take an infinite amount of time to process, and produce some output, _using a finite amount of memory_. Any classical procedural or OOP language can be turned into a well-behaved Turing machine, whereas NodeJS (and JS, by extension) apparently can't. And that irks me to no end... :( – Bogdan Stăncescu Nov 20 '19 at 01:00
  • @BogdanStăncescu - What is non-Turing about `mapConcurrent()`? If you want a flow of results (in order) as they are available (so no large set of result data is accumulated), that would be a different design with a different interface and is perfectly doable (though wasn't really specified in your question). I'd want to know more about the actual requirements of the project or desires for the design rather than just "more Turing like". Likewise if you want it to be a machine you can continually feed new data to, I'd need to know about that too. – jfriend00 Nov 20 '19 at 01:07
  • @jfriend00, how should we go about a more fluid exchange? I think this medium is limiting. – Bogdan Stăncescu Nov 20 '19 at 01:09
  • @BogdanStăncescu - Well, you could write a new question, describe what you're looking to do (specific requirements), describe what you've used so far and what's its limitations are and solicit new ideas that way. There's also stackoverflow chat for interactive communication, though I rarely do that because it requires both ends sitting at their computer and participating in a timely fashion. – jfriend00 Nov 20 '19 at 01:20
  • @jfriend00, my concern is that one can't contain the number of events piling up at the incoming end of the process, even if the code is refactored to allow for dynamic allocation (as opposed to the buffered approach you illustrated in the response above). – Bogdan Stăncescu Nov 20 '19 at 01:25
  • @BogdanStăncescu - If you have more input coming than you can process (so it's at least temporarily piling up), what do you want a solution to do other than queue it up until you can get to it? This solution is about making sure that you stay within the bounds of your processing resources so you don't blow things up. That is the general solution. Anything more than that seems application specific. Maybe what you're asking about is a more generalized queuing system rather than how to efficiently process a batch. This is a batch solution. – jfriend00 Nov 20 '19 at 01:29
  • @jfriend00, but that's specifically the point: my data is coming from parsing the folders on the very server which is running this code (see the OP). It's not like I have to throttle some incoming deluge of data – I'm unable to control the data my own code is generating. I find that woefully frustrating. – Bogdan Stăncescu Nov 20 '19 at 01:33
  • @BogdanStăncescu - OK, you could pursue a more holistic solution that throttles the input (presumably iterating in the file system) to an acceptable rate rather than queue all the input and then throttle the processing. That probably assumes the file system isn't in flux during processing or would require additional passes to handle things that might have changed while processing. FYI, you don't need to include my name at the beginning of a comment on my own answer. I get automatically notified of all those regardless. – jfriend00 Nov 20 '19 at 01:56
  • @jfriend00, throttling the input is specifically what I asked for in the OP. – Bogdan Stăncescu Nov 20 '19 at 01:59
  • @BogdanStăncescu - Can't offer you anything there without seeing the code for the input. There's no theoretical answer to that. It's entirely dependent upon where the input comes from. I'm think I'm done here with this theoretical part of the discussion when there's no code you're sharing. Theoretical solutions are way, way, way more complicated than seeing a real coding problem and offer a real coding solution. – jfriend00 Nov 20 '19 at 03:04
  • The concrete problem is actually trivial – I never avoided posting it in order to protect my code, but because it's positively textbook boring. Should I post a POC in the OP, or elsewhere? – Bogdan Stăncescu Nov 20 '19 at 03:09
  • @BogdanStăncescu - You've completely lost me what you want help with. I've spent a lot of energy trying to help you, but am lost as to how to help further. I think I'm going to work on other questions now. You could post a new question if you have something new you want to cover that you could clearly convey in a new question. FYI, I don't know what POC is. – jfriend00 Nov 20 '19 at 03:14
  • @BogdanStăncescu - I read your question one more time. node.js does not offer any way to throttle the event system itself. So, you cannot throttle at that level. So, if you want to throttle some input, then you have to work on how to throttle that specific type of input and some activities such as listing the files in a directory only come all or nothing. If you were listing thousands of files in many directories, then you could design a system that would provide input only as requested, by some target but it would still have to buffer chunks internally. – jfriend00 Nov 20 '19 at 04:09
  • @BogdanStăncescu - cont'd. That's because the underlying file operations only supply lists of files in chunks so they have to be buffered somewhere. – jfriend00 Nov 20 '19 at 04:10
  • @BogdanStăncescu - So, there is no generic answer to your question about throttling events in node.js. It's not a feature that node.js offers and the dispatching of events is internal to the implementation of node.js, not something we get direct access to as a node.js Javascript developer. If you have some series of asynchronous file operations in a loop or recursive or something like that, then that could be throttled somewhat, but how to do that would be entirely specific to your implementation, not a general solution. With no code to go on, not much to do there. – jfriend00 Nov 20 '19 at 04:14
  • 1
    @BogdanStăncescu - So, I don't know what else you're waiting for in an answer... – jfriend00 Nov 20 '19 at 04:16
  • @BogdanStăncescu `async` is callback based and was the original library to implement all the asynchronous control flow design patterns. Both `async-q` and `async-promises` are reimplementations of `async` but promise based. I was merely saying that `async-q` has all or almost all the functions of `async` while `async-promises` has only some. – slebetman Nov 20 '19 at 05:54