Ending pdf to txt conversion if process exceeds a given time threshold

Question

I am trying to convert a corpus of .pdf documents into a corpus of .txt documents using the pdfminer pdf2txt package. The process works well on most documents, but some of the PDFs are taking an exceptionally long time to convert. Some never actually seem to finish converting, and the process gets stuck. I'm trying to figure out how stop the conversion if it exceeds more than a few minutes of processing time. I can create a timer function, but how do I get pdf2txt to skip a document that is taking too long and move on to the next document?

I've included the code for my for loop here without any timer function.

import os
import subprocess as sp
import requests

documents = <list of .pdf filenames>
dir = '../data/'
for doc in documents:
    txt = dir+doc[0:-3]+'txt'
    cmd = "pdf2txt.py "+dir+doc" > "+txt
    sp.run([cmd], shell=True)

A large number of these documents are scans, so not text-based PDFs. pdf2text is able to handle most of those, but for a few the code is getting stuck on the shell command.

score 0 · Accepted Answer · answered Aug 13 '19 at 05:23

subprocess.check_out has a timeout parameter. Documentation Code Example

To further improve your processing time, you can do asynchronous process calls instead of waiting for processing each file before processing the next. Code Example(Check Update2 in the question)

Ending pdf to txt conversion if process exceeds a given time threshold

1 Answers1