3

I have thousands of searchable PDFs, some of which are up to a 1GB with over 2000 pages. I need to be able to search for a text string in these files using a Node.js app.

Right now, files are stored in a Google Cloud Storage bucket.

What's the best way to do this?

Some options:

  • Read the text from PDF files into MySQL using something like NPM package pdf-text-extract. Then use MySQL queries to search for text strings.
  • Search the PDF files directly using some NPM package.

Am I completely off? Is there a better way?

markkazanski
  • 439
  • 7
  • 20
  • Not available in Node, but the [GAE Search API](https://cloud.google.com/appengine/docs/standard/python/search/) looks quite exactly to be what you need. As long as you extract the text from the PDF before inserting the document (e.g. with a tool like [this one](https://pdfbox.apache.org/2.0/commandline.html#extracttext)), it should work for you. If you're comfortable with other languages, you could build a GAE service that only handles this search part, and call it from your Node app. I can expand the explanation if you're interested in this route. – Jofre Aug 15 '18 at 16:18
  • The documentation says that the maximum document size is 1MB. I have files up to 1GB. And, I'm really only comfortable with Node.js. – markkazanski Aug 15 '18 at 20:38

1 Answers1

0

There are dedicated text search libraries out there, like this one, or this. Most likely you'd need to extract plain text from each pdf, save and index them. Then you'll be able to run search queries. Setting up database for this particular task may be an overkill.

  • If using the Elasticlunr NPM package and I build and index, is the index in memory on the server? Is that workable with 80GB of PDFs? The actual amount of text extracted may be less, but it would still be 100s of MB. With MySQL I was thinking of using `FULLTEXT KEY` to use full-text indexing and searching function of InnoDB. Is that reasonable? Also, this would be deployed on Google App Engine or Compute Engine. – markkazanski Aug 14 '18 at 20:21
  • @markkazanski You can deploy it in either App Engine or Compute Engine. Do you have anything prepared already in any of those environments? – Mangu Aug 20 '18 at 09:48