When I copy and paste a pdf document into a text file using ctrl+a, ctrl+c, ctrl+v I get a result like this:
but when I use pdfminer with the code below i get this:
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
*....*
def scrub(self):
text = self.convert(self.inFile)
with open(self.WBOutputFile, "w") as WBOut:
WBOut.write(text)
#code from Tim Arnold at https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167
def convert(self, fname):
pagenums = set()
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
*....*
The code takes several seconds longer than doing it manually but I want to automate this pdf to text process because I have a lot of documents. Is there a way to get similar result (in terms of speed and formatting) similarly to copy and paste? I am using chrome as my pdf viewer, sublime text as my text editor, and windows 8 as my OS.
I am using pdf from http:// www. supremecourt.gov/oral_arguments/argument_transcripts/14-8349_n648 .pdf