10

I want to use pdfminer.six which is a tool, that can be used with Python3 for extracting information from PDF documents. The problem is there is no good documentation at all and no source code example on how to use the tool.

I have already tried some code from StackOverflow but it didn't work. Below is my code.

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

I want some code example on how to use this tool to get data from PDFs.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Urvish
  • 643
  • 3
  • 10
  • 19

2 Answers2

21

Install pdfminer.six or pdfminer3 (https://github.com/gwk/pdfminer3/) install: pip install pdfminer3 I switched to pdfminer3 when I upgraded to 3.7 from 3.6 I use on ubuntu and macos with python 3.7.3

pdfminer3 comes with two handy tools: pdf2txt.py and dumppdf.py examine the source. Fairly small and easy to understand.

Following is a working example (once the location of the pdf file is added)

from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io

resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)

with open('/path/to/file.pdf', 'rb') as fh:

    for page in PDFPage.get_pages(fh,
                                  caching=True,
                                  check_extractable=True):
        page_interpreter.process_page(page)

    text = fake_file_handle.getvalue()

# close open handles
converter.close()
fake_file_handle.close()

print(text)
pgcan
  • 1,199
  • 14
  • 24
LaVar
  • 234
  • 2
  • 2
  • 3
    I want some more advice on this problem. Consider that I have PDF files with tables and some text as well. Like one paragraph or some details and then some table describing some details. The best example is the income tax return or bank statement. Now if I use PDFminer is it possible to get data in a meaningful way? Like what was the year or ITR and how much someone has paid ITR etc? – Urvish Jun 11 '19 at 05:10
  • Because of `converter's laparams` parameter unspecified text could be extracted without spaces. Created question and answered it (https://stackoverflow.com/questions/58889337/pdfminer3-extracts-text-from-pdf-without-spaces) if anyone will search answer this way – A.Ametov Nov 16 '19 at 09:36
  • How to add a page delimiter in the above code after each page ? – Palash Jhamb May 27 '20 at 21:46
  • is there any way to convert the text in an array of decimals or integers??? – Aleena Rehman Aug 19 '20 at 21:21
  • @Urvish use tabula-py – mounaim Nov 09 '20 at 15:07
3

Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

Nowadays, it has multiple api's to extract text from a PDF, depending on your needs. Behind the scenes, all of these api's use the same logic for parsing and analyzing the layout.

(All the examples assume your PDF file is called example.pdf)

Commandline

If you want to extract text just once you can use the commandline tool pdf2txt.py:

$ pdf2txt.py example.pdf

High-level api

If you want to extract text (properties) with Python, you can use the high-level api. This approach is the go-to solution if you want to programmatically extract information from a PDF.

from pdfminer.high_level import extract_text

# Extract text from a pdf.
text = extract_text('example.pdf')

# Extract iterable of LTPage objects.
pages = extract_pages('example.pdf')

Composable api

There is also a composable api that gives a lot of flexibility in handling the resulting objects. For example, it allows you to create your own layout algorithm. This method is suggested in the other answers, but I would only recommend this when you need to customize some component.

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

Similar question and answers here. I'll try to keep them in sync.

Pieter
  • 3,262
  • 1
  • 17
  • 27
  • I'm getting "None" as responses from the extract_text() method, one for each page, when I try and print it to the console. Why might that be happening? I tried with 2 different PDFs, both academic studies from different sources. – Tyler Cheek Dec 02 '21 at 00:06
  • Okay I figured that part out: I had used the Print function in Firefox to Save As PDF, instead of directly downloading it. Doing so makes the whole page just an image instead of a collection of glyphs. – Tyler Cheek Dec 02 '21 at 00:15