I am trying to convert a corpus of .pdf documents into a corpus of .txt documents using the pdfminer pdf2txt package. The process works well on most documents, but some of the PDFs are taking an exceptionally long time to convert. Some never actually seem to finish converting, and the process gets stuck. I'm trying to figure out how stop the conversion if it exceeds more than a few minutes of processing time. I can create a timer function, but how do I get pdf2txt to skip a document that is taking too long and move on to the next document?
I've included the code for my for loop here without any timer function.
import os
import subprocess as sp
import requests
documents = <list of .pdf filenames>
dir = '../data/'
for doc in documents:
txt = dir+doc[0:-3]+'txt'
cmd = "pdf2txt.py "+dir+doc" > "+txt
sp.run([cmd], shell=True)
A large number of these documents are scans, so not text-based PDFs. pdf2text is able to handle most of those, but for a few the code is getting stuck on the shell command.