114

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python.

It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this.

As it is, I'm just looking at source-code to see if I can figure it out.

Cornelius Roemer
  • 3,772
  • 1
  • 24
  • 55
RattleyCooper
  • 4,997
  • 5
  • 27
  • 43
  • 2
    Please check out http://stackoverflow.com/help/how-to-ask and http://stackoverflow.com/help/mcve and update your answer so it is in a better format and aligns to the guidelines. – Parker Oct 21 '14 at 19:03
  • Which distribution of Python are you using, 2.7.x or 3.x.x? It should be noted that the author *explicitly* detailed that `PDFminer` doesn't work with Python 3.x.x. That might be the reason you're getting `import` errors. You should use `pdfminer3k` if so, as it is the standing Python 3 import of said library. – WGS Oct 21 '14 at 19:13
  • @Nanashi, sorry, I forgot to add my Python version. It's 2.7 so that isn't the issue. I have been looking through the source-code and it looks like they restructured some things which is why the imports are breaking. I can't find any documentation for PDFMiner either or I would just be working off of that :( – RattleyCooper Oct 21 '14 at 19:14
  • I have just literally installed `PDFminer` off from GitHub and it imports fine. Can you kindly post your code and post your full error traceback as well? – WGS Oct 21 '14 at 19:18
  • @Nanashi, Like I said in my original question, the libraries that rely on PDFMiner break before finishing imports along with any example that I can find. This is not a PDFMiner issue. This is me looking for documentation, or an example of how to use PDFMiner. Everything I can find is using an old syntax for PDFMiner. I went ahead and edited my question for clarity. I think I made it more confusing than it needed to be. Sorry about that. – RattleyCooper Oct 21 '14 at 19:19
  • If that's the case, you're in for a downer: the docs are *very* sparse. The offline docs coming in with the GitHub download didn't even break 100KB. In addition, the Google user group is not active, I believe. If you're willing to brave the rather insufficient docs, here's the relevant [link](http://www.unixuser.org/~euske/python/pdfminer/programming.html). A recommended example is [here](http://denis.papathanasiou.org/2010/08/04/extracting-text-images-from-pdf-files/) as well. – WGS Oct 21 '14 at 19:24
  • Admittedly, even the recommended example is outdated: it's from Jan 2012. The API, as you said, was updated March this year. If all else fails, it looks like you will have to port some of the functions yourself. I don't think it will be *that* difficult, but if entire class structures and methods were deprecated or changed, therein lies the problem. However, based on changelogs, it seems like the only real big change happened when the package was updated to accommodate 2.6 as the minimum, up from 2.4. Unless that's the level of update you need, I think it's pretty easy to port it. – WGS Oct 21 '14 at 19:26
  • @Nanashi, I'm trying to just point the imports in the examples to the right location to see if the methods retained their functionality. Hopefully it will work! – RattleyCooper Oct 21 '14 at 19:27
  • Possible duplicate of [How do I use pdfminer as a library](http://stackoverflow.com/questions/5725278/how-do-i-use-pdfminer-as-a-library) – tripleee Mar 02 '16 at 07:17

6 Answers6

213

Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.

Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x

Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
RattleyCooper
  • 4,997
  • 5
  • 27
  • 43
  • 3
    works fine, but, how can I deal with spaces in for example names? suppose I have a pdf that contains 4 columns where I have first- and lastname in one col, now it get parsed with firstname in one row and lastname in one row, here's an example http://docdro.id/rRyef3x – Deusdeorum Mar 19 '16 at 11:45
  • 3
    Currently getting an import error with this code: ImportError: No module named 'pdfminer.pdfpage' – Jeffrey Swan Oct 22 '16 at 14:43
  • @Jefe , make sur you run make install after downloading pdfminer. – Francois Jan 19 '17 at 14:09
  • works on ubuntu 16.04 as of February 1, 2017 - big thanks, finally sth that actually works :) – fanny Feb 01 '17 at 00:12
  • Thanks. Python 3 users might want to change `from cStringIO import StringIO` to `from io import StringIO`. See http://stackoverflow.com/a/18284900/701284 – Tsan-Kuang Lee May 11 '17 at 22:46
  • 2
    Thanks it works on python v2.7.12 and on ubuntu 16.04, though it would be better to load the pdf document with encoding utf-8, because my sample pdf has some encoding issue so try this after encoding with utf-8 and it resolve the issue... `import sys reload(sys) sys.setdefaultencoding('utf-8')` – sib10 May 29 '17 at 11:15
  • 2
    July, 19, 2017, still working on Python v2.7.12. Also, it's good to note that if you want HTML instead of plain text, you can switch `TextConverter` to `HTMLConverter`. – Qrom Jul 19 '17 at 13:16
  • 3
    @DuckPuncher, Is it still working now? I had to change the `file(path, 'rb')` to `open(path, 'rb') to get mine to work. – craned Oct 17 '17 at 15:04
  • Hi..When i'm trying to get data from pdf..using above code..i'm getting error saying. TypeError: unicode argument expected, got 'str' at interpreter.process_page(page) Please help..if you are aware..thanks – Niks Jain Nov 21 '17 at 14:11
  • 3
    Still working for Python3.7 users. Installed pdfminer.six==20181108 package. Best solution so far for my case and I compared numerous solutions. – aze45sq6d Nov 05 '19 at 09:02
  • Tried with `pdfminer.six==20181108` and `pdfminer.six==20200124` ; in both cases nothing is extracted except `\x0c`, although the doc is a simple dummy doc generated with word. Working with `Python 3.7`. – Vincent Mar 18 '20 at 12:09
  • How can I does it get to work? Do I need to change inside the path in def convert_pdf_to_txt(path): with my directory path? – Kijimu7 Mar 24 '20 at 16:30
  • 2
    Check out below answer which works in May 2020 and is very simple: `from pdfminer.high_level import extract_text` then `text = extract_text('report.pdf')` https://stackoverflow.com/a/61857301/7483211 – Cornelius Roemer May 17 '20 at 19:09
  • 2
    I got a "got an unexpected keyword argument 'codec'" error. It seems like the 'codec' parameter been since been removed. It worked when I removed the codec parameter from TextConverter(...) – Alex Morgan Nov 15 '20 at 20:06
  • I used same code on windows python 3.7 . Text extracted but all newlines are lost! Any idea? – Sandeep Bhutani Dec 13 '20 at 17:54
33

This works in May 2020 using PDFminer six in Python3.

Installing the package

$ pip install pdfminer.six

Importing the package

from pdfminer.high_level import extract_text

Using a PDF saved on disk

text = extract_text('report.pdf')

Or alternatively:

with open('report.pdf','rb') as f:
    text = extract_text(f)

Using PDF already in memory

If the PDF is already in memory, for example if retrieved from the web with the requests library, it can be converted to a stream using the io library:

import io

response = requests.get(url)
text = extract_text(io.BytesIO(response.content))

Performance and Reliability compared with PyPDF2

PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7

However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.

I timed text extraction with timeit on a 15" MBP (2018), timing only the extraction function (no file opening etc.) with a 10 page PDF and got the following results:

PDFminer.six: 2.88 sec
PyPDF2:       0.45 sec

pdfminer.six also has a huge footprint, requiring pycryptodome which needs GCC and other things installed pushing a minimal install docker image on Alpine Linux from 80 MB to 350 MB. PyPDF2 has no noticeable storage impact.

Update (2022-08-04): According to Martin Thoma, PyPDF2 has improved a lot in the past 2 years, so do give it a try as well. Here's his benchmark

Cornelius Roemer
  • 3,772
  • 1
  • 24
  • 55
  • 2
    PyPDF2 had a lot of improvements since this answer was given. Especially the text extraction was improved a lot. In [my benchmark](https://github.com/py-pdf/benchmarks) the text extraction of PyPDF2 is now better than the one of pdfminer – Martin Thoma Aug 08 '22 at 09:56
32

terrific answer from DuckPuncher, for Python3 make sure you install pdfminer2 and do:

import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)



    fp.close()
    device.close()
    text = retstr.getvalue()
    retstr.close()
    return text
manish Prasad
  • 636
  • 6
  • 16
juan Isaza
  • 3,646
  • 3
  • 31
  • 37
  • 2
    It doesn't work for me: ModuleNotFoundError: No module named 'pdfminer.pdfpage' i am using python 3.6 – Atti Jul 17 '17 at 15:45
  • @Atti, just in case, make sure that you have pdfminer2 installed, as there is another package pdfminer (I hate this). It works for pdfminer2==20151206 version when doing pip3 freeze. – juan Isaza Jul 19 '17 at 03:26
  • 7
    thanks i got it working eventually, i installed pdfminer.six from conda forge – Atti Jul 19 '17 at 07:44
  • 10
    For Python 3, pdfminer.six is the recommended package - https://github.com/pdfminer/pdfminer.six – Mike Driscoll Apr 12 '18 at 15:54
  • Is this still current. I'm getting the same `ImportError:` message –  May 09 '18 at 04:03
  • @Punter345: its current. Make sure you install "pdfminer2" not "pdfminer" – juan Isaza May 09 '18 at 18:12
  • try installing pdfminer3. this works for me under python 3.6 and pdfminer3 version 2018.12.3.0 – Stavros Afxentis Jul 17 '19 at 20:18
  • Error from python 3.7 - ModuleNotFoundError: No module named 'chardet'. Installing chardet using pip install chardet fixed this error. – Avnish Tiwary Sep 26 '19 at 13:42
  • Check out below answer which works in May 2020 and is very simple: `from pdfminer.high_level import extract_text` then `text = extract_text('report.pdf')` https://stackoverflow.com/a/61857301/7483211 – Cornelius Roemer May 17 '20 at 19:10
29

Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

Nowadays, it has multiple api's to extract text from a PDF, depending on your needs. Behind the scenes, all of these api's use the same logic for parsing and analyzing the layout.

(All the examples assume your PDF file is called example.pdf)

Commandline

If you want to extract text just once you can use the commandline tool pdf2txt.py:

$ pdf2txt.py example.pdf

High-level api

If you want to extract text (properties) with Python, you can use the high-level api. This approach is the go-to solution if you want to programmatically extract information from a PDF.

from pdfminer.high_level import extract_text

# Extract text from a pdf.
text = extract_text('example.pdf')

# Extract iterable of LTPage objects.
pages = extract_pages('example.pdf')

Composable api

There is also a composable api that gives a lot of flexibility in handling the resulting objects. For example, it allows you to create your own layout algorithm. This method is suggested in the other answers, but I would only recommend this when you need to customize some component.

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

Similar question and answer here. I'll try to keep them in sync.

Pieter
  • 3,262
  • 1
  • 17
  • 27
2

this code is tested with pdfminer for python 3 (pdfminer-20191125)

from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal

def parsedocument(document):
    # convert all horizontal text into a lines list (one entry per line)
    # document is a file stream
    lines = []
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(document):
            interpreter.process_page(page)
            layout = device.get_result()
            for element in layout:
                if isinstance(element, LTTextBoxHorizontal):
                    lines.extend(element.get_text().splitlines())
    return lines
  • I have PDF files which I am able to convert using the Nitro Pro tool. When I try to convert the same PDF using the code posted here, however, I get output which suggests that there is a permissions error. Here is the output: ('from the SAGE Social Science Collections. All Rights Reserved.\n\n\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c') – b00kgrrl Feb 22 '20 at 18:26
  • What do you mean a file stream? – Vincent Mar 18 '20 at 13:45
  • @Vincent with open(file,'rb') as stream: [...] – Rodrigo Formighieri Apr 17 '20 at 21:24
  • do you manage to get this file as a table/pandas ideally? https://www.groupe-psa.com/en/publication/monthly-world-sales-march-2020/ – Je Je May 02 '20 at 00:57
1

I realize that this is an old question. For anyone trying to use pdfminer, you should switch to pdfminer.six which is the currently maintained version.

julie
  • 172
  • 1
  • 6