I am trying to convert thousands of PDF files to HTML. I was able to convert this PDF file to this HTML file using the following code:
def convertPDFToHtml():
command = 'pdf2txt.py -o output.html -t html test.pdf'
os.system(command)
I want to be able to parse the HTML file so that I can extract different texts from it. The problem now is that the output HTML file is missing a lot of text from the original file.
Is there a better to convert the PDF file and parse the HTML text ?