I tried using NodeJS in a server-side script to parse the text content in local PDF files using pdf-parse, which in turn uses Mozilla's amazing PDF parser. Everything worked wonderfully in my dev sandbox, but the whole thing came crashing down on me when I attempted to use the same code in production.
My problem was caused by the sheer number of PDF files I'm trying to process asynchronously: I have more than 100K files that need processing, and Mozilla's PDF parser is (understandably) unconditionally asynchronous – the OS killed my node process because of too many open files. I had started by writing all of my code asynchronously (the preliminary part where I search for PDF files to parse), but even after refactoring all the code for synchronous operation, it still kept crashing.
The gist of the problem is related to the cost of the operations: walking the folder structure to look for PDF files is cheap, whereas actually opening the files, reading their contents and parsing them is expensive. So Node kept generating new promises for each file it encountered, and the promises were never fulfilled. If I tried to run the code manually on smaller folders, it worked like a charm – really fast and reliable. As soon as I tried to execute the code on the entire folder structure it crashed, no matter what.
I know Node enthusiasts always answer questions like these by saying the OP is using the wrong programming pattern, but I'm stumped as to what would be the correct pattern in this case.