0

I need to write a rest service that offers batch validation for pdf documents.

The basic validation workflow is as follows:

Files are uploaded via rest api one file at a time. Documents have sizes between 10- and 150 Megabytes on average. Once a batch is complete validation starts: Every document is successively taken from storage, validated, a report is being generated and the originally uploaded document will get deleted afterwards.

The platform to be used for development is Java EE (Jersey and EJB). Since EJB doesn't allow for saving data directly to disk as a file, I've considered using a database via JPA to temporarily save the files until processing.

Is this a sound choice or would you prefer a different solution that I haven't thought of?

Is using a database for this scenario a bad idea performance wise?

We're expecting batches of up to 4000 documents. I'm especially worried about performance bottlenecks and resource consumption (ram, disk space).....

Quercus
  • 5
  • 3
  • if you just want to store raw PDF files, I would simply use a regular file system. A DB would only make sense if you are parsing information out of the PDFs and storing the data from the PDF in an orderly fashion. – Andy Guibert Nov 11 '17 at 20:42
  • Have you considered using AWS Lambda? – vikarjramun Nov 19 '17 at 02:09

1 Answers1

0

I recommend using a SQL-Database able to handle blobs:

  1. I think there is no big difference in performance, since modern filesystems also use mechanisms as they are used for DBMS. Like redo-logs, transactions, indexing...

  2. The big advantage is that you are able to use the javaee-mechanisms including transactions to make sure that exceptional situations can be handled in a safe way. You will not need special libraries to handle secure file-saving.

  3. I suppose you would need a dbms anyway beside storing files. So if you use only one persistent storage that makes the problem of handling distributed transactions easier (there aren't any).

  4. If you will have to work with IT-Admins they might easier provide multiples of 4000 times 150 MBytes (600 GBytes) storage as SQL-DB than as filesystem in my experience.

aschoerk
  • 3,333
  • 2
  • 15
  • 29