How to get the same result as copy and pasting pdf to text using python?

Question

When I copy and paste a pdf document into a text file using ctrl+a, ctrl+c, ctrl+v I get a result like this:

but when I use pdfminer with the code below i get this:

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

*....*

def scrub(self):
    text = self.convert(self.inFile)
    with open(self.WBOutputFile, "w") as WBOut:
        WBOut.write(text)

#code from Tim Arnold at https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167
def convert(self, fname):
    pagenums = set()

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text

*....*

The code takes several seconds longer than doing it manually but I want to automate this pdf to text process because I have a lot of documents. Is there a way to get similar result (in terms of speed and formatting) similarly to copy and paste? I am using chrome as my pdf viewer, sublime text as my text editor, and windows 8 as my OS.

I am using pdf from http:// www. supremecourt.gov/oral_arguments/argument_transcripts/14-8349_n648 .pdf

Possible duplicate of [How to extract text from a PDF file in Python?](http://stackoverflow.com/questions/15583535/how-to-extract-text-from-a-pdf-file-in-python) — Peter Wood, Nov 09 '15 at 23:34
I tried using pyPdf before but I wasn't able to get simple formatting from it either. My question is more specific to whether or not python can use similar method as ctrl-a-c-v to copy text from pdf. — kkawabat, Nov 09 '15 at 23:46
I think Chrome is doing something clever to allow you to select and copy sections of text. A PDF is really formatted for printing, and the semantic structure of the document is no longer important. You might have to do a lot of work to reconstitute lines, paragraphs, etc. — Peter Wood, Nov 10 '15 at 07:21
@kkawabbat - great question and hard to understand why it's so difficult to find a good solution - did you ever find anything? — elPastor, Oct 12 '18 at 21:09

score 2 · Answer 1 · answered Mar 29 '16 at 08:55

2

try setting the char_margin in the laparams to 50.

i.e.

laparams=LAParams()
laparams.char_margin = float(50)
converter = TextConverter(manager, output, laparams=laparams)
interpreter = PDFPageInterpreter(manager, converter)

answered Mar 29 '16 at 08:55

Dennis Chou

21
3

How to get the same result as copy and pasting pdf to text using python?

1 Answers1