45

I am using Python 3.4 and need to extract all the text from a PDF and then use it for text processing.

All the answers I have seen suggest options for Python 2.7.

I need something in Python 3.4.

Bonson

Bonson
  • 1,418
  • 4
  • 18
  • 38
  • 3
    Not sure why the down vote. As I mentioned, I checked all available and also on google. The only one I found that can be used with Python 3.4 was in this [xPDF detail](http://stackoverflow.com/questions/18320932/looking-for-recommendation-on-how-to-convert-pdf-into-structured-format?lq=1) all else are of version 2.7. I have found nothing on version 3.4 of Python. Request to also comment when down voted. – Bonson Sep 19 '15 at 14:16
  • 2
    This a good yet blatantly off-topic question. Use [SoftwareRecs](https://softwarerecs.stackexchange.com/) for library recommendations. – Nino Filiu Jun 28 '19 at 19:13
  • you can try this solution its work good in python 3 [Link](https://stackoverflow.com/a/54936587/7521283) – Akshay Kumbhar Dec 20 '19 at 09:23
  • [pdfplumber](https://github.com/jsvine/pdfplumber) is the best option. [[Reference](https://stackoverflow.com/a/66785646/8321339)] – Vishal Gupta Mar 24 '21 at 16:50

5 Answers5

52

You need to install the pypdf package to be able to work with PDFs in Python. pypdf can extract text/images. The text is returned as a Python string. To install it, run pip install pypdf from the command line. This module name is case-sensitive so make sure to type all lowercase.

from pypdf import PdfReader

reader = PdfReader('my_file.pdf')
print(len(reader.pages))  # gives '56'
page = reader.pages[9]    #'9' is the page number
page.extract_text()

The last statement returns all the text that is available in page 9 of 'my_file.pdf' document.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
JohnnyBravo-xyz
  • 666
  • 6
  • 7
  • Hi Ritesh, By any chance you you know the anser to this question. [Question](http://stackoverflow.com/questions/32773517/python-based-pdf-mining-and-table-text-processing) . – Bonson Sep 25 '15 at 03:07
  • 4
    Minor correction - think there shoudl be quotations for "rb" in the open command on line two rather than just rb. – kyrenia Aug 01 '16 at 19:31
  • 4
    Furthermore, the pages in pypdf2 are zero-indexed, i.e. `getPage(9)` will get you page #10. Page numbers in the original document are completely ignored by pypdf2. – nostradamus Oct 28 '16 at 07:42
  • One problem with pypdf2 is that in some cases it ignores new line characters which is not really a good thing! – Pedram Jul 13 '17 at 00:05
  • When importing tables with pypdf2, unfortunately there are no separators between cells. Numbers in adjacent cells get placed together, and there is no way to programmatically recognize what part of the number belongs to which cell. Has anybody figured out a solution for that? – Wael Hussein Aug 18 '18 at 16:15
  • 1
    CAUTION: a) Not supported in Py3 and b) It ignores entire word if it has an un-parsable unicode ( e.g " ) https://github.com/mstamy2/PyPDF2/issues/37 and it is unpredictable as commented by others above. It is a good tool, but not for production code sadly :\( – user2390183 Dec 11 '18 at 10:27
  • 1
    2 years and they haven't fixed this bug https://github.com/mstamy2/PyPDF2/issues/254 I'd prefer to find a package that is properly supported. This one can't handle python 3. – Jeff Winchell Apr 19 '19 at 23:41
  • After looking everywhere, this worked for me! Thank you – anish Dec 06 '19 at 17:53
  • PyPDF2 has improved a lot in 2022. Most comments above are no longer valid. – Martin Thoma Dec 20 '22 at 18:12
7

pdfminer.six ( https://github.com/pdfminer/pdfminer.six ) has also been recommended elsewhere and is intended to support Python 3. I can't personally vouch for it though, since it failed during installation MacOS. (There's an open issue for that and it seems to be a recent problem, so there might be a quick fix.)

Sarah Messer
  • 3,592
  • 1
  • 26
  • 43
3

Complementing @Sarah's answer. PDFMiner is a pretty good choice. I have been using it from quite some time, and until now it works pretty good on extracting the text content from a PDF. What I did is to create a function which uses the CLI client from pdfminer, and then it saves the output into a variable (which I can use later on somewhere else). The Python version I am using is 3.6, and the function works pretty good and does the required job, so maybe this can work for you:

def pdf_to_text(filepath):
    print('Getting text content for {}...'.format(filepath))
    process = subprocess.Popen(['pdf2txt.py', filepath], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    stdout, stderr = process.communicate()

    if process.returncode != 0 or stderr:
        raise OSError('Executing the command for {} caused an error:\nCode: {}\nOutput: {}\nError: {}'.format(filepath, process.returncode, stdout, stderr))

    return stdout.decode('utf-8')

You will have to import the subprocess module of course: import subprocess

AnhellO
  • 855
  • 1
  • 14
  • 16
1

slate3k is good for extracting text. I've tested it with a few PDF files using Python 3.7.3, and it's a lot more accurate than PyPDF2, for instance. It's a fork of slate, which is a wrapper for PDFMiner. Here's the code I am using:

import slate3k as slate

with open('Sample.pdf', 'rb') as f:
    doc = slate.PDF(f)

doc
#prints the full document as a list of strings
#each element of the list is a page in the document

doc[0]
#prints the first page of the document

Credit to this comment on GitHub: https://github.com/mstamy2/PyPDF2/issues/437#issuecomment-400491342

Kristen
  • 11
  • 2
0
import pdfreader
pdfFileObj = open('/tmp/Test-test-test.pdf','rb')
viewer = SimplePDFViewer(pdfFileObject)
viewer.navigate(1)
viewer.render()
viewer.canvas.strings
Larytet
  • 648
  • 3
  • 13