0

I’m working on a functionality where I need to convert a huge html file (size more than 1 mb) into pdf. I’ve tried below two open-source python libraries. 1. Xhtml2pdf (Pisa) 2. Weasyprint

But none of them solves my problem as they take around 4-5 mins in generating 1 MB PDF file (around 500 pages) causing my app server’s worker process (Gunicorn and Nginx) to get down and throwing ‘GATEWAY TIMEOUT ERROR’ on browser. CPU utilization also goes up to 100% while PDF conversion is in process.

If anybody is having any idea which API/library will be a best suit for large html files.

Sachin Chauhan
  • 75
  • 1
  • 1
  • 7

2 Answers2

1

Generating a 500 pages PDF will take time whatever technologie you use, so the solution is to send the job to an async task queue (celery, huey, django-queue, ...), eventually with some polling to show a progressbar. Even if you manage to optimize the crap out of the generation process, it will STILL takes too much time to fit in an HTTP request/response cycle (from the user's POV at least even one minute is already way to long)

NB : having your CPU maxing out is nothing surprising either - generating a huge PDF not only takes time, it's also a computation-heavy process, and one that easily eats your memory too. This by itself is another reason to use a distributed task queue so you can run the process on a distinct node and avoid killing your front server).

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
  • Thanks @Bruno. We can schedule an async task through celery but the problem is wherever that task will execute, it will eat up CPU and memory so other processes will also get impacted on the server or in worst case, they may stop working. Any comment on how to handle CPU utilization please. – Sachin Chauhan Jun 29 '18 at 11:50
  • Limiting the CPU usage of a process is to be dealt with at the OS level so you'll have to ask your sysadmin for this part. wrt/ memory, the simplest solution is to make sure you run this task on a dedicated (or mostly dedicated) node with enough ram to handle your biggest expected PDF - ram is not that expensive nowadays. Oh and yes, make sure your celery workers have MAX_TASK_PER_CHILD=1 so they release the ram after each generation. – bruno desthuilliers Jun 29 '18 at 12:00
0

It's just a guess, I never used it, but I found this answer: C++ Library to Convert HTML to PDF? And as far as I know there is Cython, which can be used to combine C/C++ and Python. Probably that will speed things up.

Otherwise you would need to either break it into small peices and merge them or do something with timeout parameter inside classes, that are responsible for it, but this has to be done on both sides - server and client. But I guess you would need to calculate it dynamically depending on file size and needed time and it doesn't sound to me like best desicion, but just in case...