347

I'm trying to extract the text included in this PDF file using Python.

I'm using the PyPDF2 package (version 1.27.2), and have the following script:

import PyPDF2

with open("sample.pdf", "rb") as pdf_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.pages[0]
    page_content = page.extractText()
print(page_content)

When I run the code, I get the following output which is different from that included in the PDF document:

 ! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 4
5
 ' % 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &)
%

How can I extract the text as is in the PDF document?

rapto
  • 405
  • 3
  • 15
Simplicity
  • 47,404
  • 98
  • 256
  • 385
  • 8
    Copy the text using a good PDF viewer - Adobe's canonical Acrobat Reader, if possible. Do you get the same result? The difference is not that the *text* is different, but the *font* is - the character codes map to other values. Not all PDFs contain the correct data to restore this. – Jongware Jan 17 '16 at 11:51
  • I tried another document and it worked. Yes, it seems the issue is with the PDF itself – Simplicity Jan 17 '16 at 13:11
  • 7
    That PDF contains a character CMap table, so the restrictions and work-arounds discussed in this thread are is relevant - http://stackoverflow.com/questions/4203414/pypdf-unable-to-extract-text-from-some-pages-in-my-pdf. – dwarring Jan 17 '16 at 21:34
  • 3
    The PDF indeed contains a correct CMAP so it is trivial to convert the ad hoc character mapping to plain text. However, it takes additional processing to retrieve the correct *order* of text. Mac OS X's Quartz PDF renderer is a nasty piece of work! In its original rendering order I get "m T’h iuss iisn ga tosam fopllloew DalFo dnogc wumithe ntht eI tutorial"... Only after sorting by x coordinates I get a far more likely correct result: "This is a sample PDF document I’m using to follow along with the tutorial". – Jongware Jan 25 '16 at 20:15
  • https://stackoverflow.com/questions/32667398/best-tool-for-text-extraction-from-pdf-in-python-3-4 – Rowf Abd Feb 09 '19 at 03:22
  • Pandas users (in particular) interested in table extraction must check bottom answers (Tabula and Camelot). – Skippy le Grand Gourou Feb 01 '21 at 17:07
  • 1
    PyPDF2 adds random whitespaces between/in words. very hard to process. – YuMei Jun 17 '22 at 11:12
  • PyPDF2 recently got way better text extraction! Give it a second try :-) – Martin Thoma Jul 02 '22 at 09:57
  • Still getting random whitespaces between words... Using PyPDF version 2.11.1 – Faraz Masroor May 17 '23 at 13:15

34 Answers34

304

I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
DJK
  • 8,924
  • 4
  • 24
  • 40
  • 36
    I tested pypdf2, tika and tried and failed to install textract and pdftotext. Pypdf2 returned 99 words while tika returned all 858 words from my test invoice. So I ended up going with tika. – Stian Jun 19 '18 at 09:11
  • 31
    I keep getting a "RuntimeError: Unable to start Tika server" error. – Nav Oct 16 '18 at 12:39
  • 5
    If you need to run this on all the PDF files in a directory (recursively), take [this script](https://gist.github.com/nadya-p/373e1dc335293e490d89d00c895ea7b3) – Hope Apr 19 '19 at 10:28
  • 1
    This is very slow as it runs a Java REST web-server in localhost port 9998 under the hoods. – andruso Oct 03 '19 at 17:38
  • 5
    for who is having the "Unable to start Tika server" error, I solved installing the last version of Java as suggested [here](https://stackoverflow.com/a/53174932/4063051), which I did on Mac Os X with `brew` following [this answer](https://stackoverflow.com/a/28635465/4063051) – glS Oct 08 '19 at 14:51
  • As I am behind firewall, tika is no use to me, because it's contacting outside server – Lovro Oct 15 '19 at 11:46
  • 3
    It downloads a `tika-server.jar` 76 MB file into `C:\Users\User\AppData\Local\Temp`. Is there a way to make this permanent if I clean `temp` later? It also requires a JAVA vm installed, is that right? – Basj Nov 15 '19 at 12:30
  • 1
    "RuntimeError: Unable to start Tika server" error <-- Same error – No Holidays Sep 25 '20 at 00:33
  • 1
    I got this error when running your code - ```RuntimeError: Unable to start Tika server```. could you help me? – jis0324 Jan 19 '21 at 18:42
  • 1
    `RuntimeError: Unable to start Tika server` was solved after actually installing Java. I installed Java 8 Update 291 - 8.0.2910.10 and Java 8 Update 291 (64-bit) - 8.0.2910.10. – Rafs Jun 01 '21 at 19:57
  • 2
    @Stian PyPDF2 improved a lot. Could you please check again + update your comment? – Martin Thoma Jul 30 '22 at 21:57
  • http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar – CS QGB Jul 31 '22 at 20:51
184

pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.

pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.

Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:

  • Anything special regarding tables (just that the text is there, not about the formatting)
  • Arabic test (RTL-languages)
  • Mathematical formulas.

That means if your use-case requires those points, you might perceive the quality differently.

Having said that, the results from November 2022:

Quality

Speed

pypdf

I became the maintainer of pypdf and PyPDF2 in 2022! The community improved the text extraction a lot in 2022. Give it a try :-)

from pypdf import PdfReader

reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

Please note that those packages are not maintained:

  • PyPDF2, PyPDF3, PyPDF4
  • pdfminer (without .six)

pymupdf

import fitz # install using: pip install PyMuPDF

with fitz.open("my.pdf") as doc:
    text = ""
    for page in doc:
        text += page.get_text()

print(text)

Other PDF libraries

  • pikepdf does not support text extraction (source)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 3
    Definably the easiest way to read a PDF, thanks! – martin36 Mar 20 '21 at 17:05
  • However, there seems to be a problem with the order of the text from the PDF. Intuitively the text would read from top to bottom and left to right, but here it seem to show up in another order – martin36 Mar 20 '21 at 19:45
  • Except, it occasionally just can't find the text in a page... – Raf Sep 21 '21 at 08:26
  • 1
    @Raf If you have an example PDF, please go ahead and create an issue: https://github.com/pymupdf/PyMuPDF/issues - the developer behin it is pretty active – Martin Thoma Sep 21 '21 at 10:01
  • This is the most light-weight answer I've seen so far. No java server necessary! – Ryan Harris Nov 02 '21 at 14:59
  • 3
    This is the latest working solution as of 23 Jan 2022. – Hissaan Ali Jan 23 '22 at 13:20
  • AttributeError: function/symbol 'ARC4_stream_init' not found in library 'C:\QGB\Anaconda3\lib\site-packages\Crypto\Util\..\Cipher\_ARC4.cp37-win_amd64.pyd': error 0x7f – CS QGB Jul 31 '22 at 20:53
  • That might be a pycryptodome issue: https://github.com/py-pdf/PyPDF2/issues/1192 – Martin Thoma Aug 01 '22 at 04:08
85

Use textract.

It supports many types of files including PDFs

import textract
text = textract.process("path/to/file.extension")
Jakobovski
  • 3,203
  • 1
  • 31
  • 38
  • 1
    Works for PDFs, epubs, etc - processes PDFs that even PDFMiner fails on. – Ulad Kasach Feb 07 '17 at 01:57
  • how to use it in aws lambda , I tried this but , import error occured fro textract – Arun Kumar Feb 27 '18 at 07:17
  • 8
    `textract` is a wrapper for `Poppler:pdftotext` (among others). – onewhaleid Apr 17 '18 at 00:21
  • 1
    @ArunKumar: To use anything in AWS Lambda that's not built-in, you have to include it and all extra dependencies, in your bundle. – Jeff Learman Jun 06 '18 at 15:58
  • @DavidBrown if you `conda install swig` before `pip install pocketsphinx` then `pip install textract` that seems to be the incantation that makes it work. – hobs Jan 14 '19 at 06:06
  • @DavidBrown: got error 'requests 2.21.0 has requirement chardet<3.1.0,>=3.0.2, but you'll have chardet 2.3.0 which is incompatible.' - tried to all versions of 'chardet'. Couldn't work in windows. – Vineesh TP Jul 14 '19 at 10:44
  • Not recomending 'textract' library. it is very difficult to run and Only works for MAC. Not working inWindows – Vineesh TP Aug 30 '19 at 13:37
  • textract is requiring me to downgrade to python 2.7 (from 3.7). No can do. – cduguet Dec 01 '19 at 08:21
  • it requires installation of extra packages in the system, but this library reads PDF like a magic. Note: it is better to add text.decode('utf-8') for non-ASCII documents – Timur Nurlygayanov Apr 22 '20 at 09:40
  • 4
    `textract` seems to be dead ([source](https://github.com/deanmalmgren/textract/issues/350)). Use either pdfminer.six directly or [pymupdf](https://stackoverflow.com/a/63518022/562769) – Martin Thoma Aug 21 '20 at 07:13
62

Look at this code for PyPDF2<=1.26.0:

import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

The output is:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

Using the same code to read a pdf from 201308FCR.pdf .The output is normal.

Its documentation explains why:

def extractText(self):
    """
    Locate all text drawing commands, in the order they are provided in the
    content stream, and extract the text.  This works well for some PDF
    files, but poorly for others, depending on the generator used.  This will
    be refined in the future.  Do not rely on the order of text coming out of
    this function, as it will change if this function is made more
    sophisticated.
    :return: a unicode string object.
    """
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Quinn
  • 4,394
  • 2
  • 21
  • 19
46

After trying textract (which seemed to have too many dependencies) and pypdf2 (which could not extract text from the pdfs I tested with) and tika (which was too slow) I ended up using pdftotext from xpdf (as already suggested in another answer) and just called the binary from python directly (you may need to adapt the path to pdftotext):

import os, subprocess
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
args = ["/usr/local/bin/pdftotext",
        '-enc',
        'UTF-8',
        "{}/my-pdf.pdf".format(SCRIPT_DIR),
        '-']
res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output = res.stdout.decode('utf-8')

There is pdftotext which does basically the same but this assumes pdftotext in /usr/local/bin whereas I am using this in AWS lambda and wanted to use it from the current directory.

Btw: For using this on lambda you need to put the binary and the dependency to libstdc++.so into your lambda function. I personally needed to compile xpdf. As instructions for this would blow up this answer I put them on my personal blog.

hansaplast
  • 11,007
  • 2
  • 61
  • 75
19

I've try many Python PDF converters, and I like to update this review. Tika is one of the best. But PyMuPDF is a good news from @ehsaneha user.

I did a code to compare them in: https://github.com/erfelipe/PDFtextExtraction I hope to help you.

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

from tika import parser

raw = parser.from_file("///Users/Documents/Textos/Texto1.pdf")
raw = str(raw)

safe_text = raw.encode('utf-8', errors='ignore')

safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )
erfelipe
  • 460
  • 4
  • 14
  • 5
    special thanks for `.encode('utf-8', errors='ignore')` – Evgeny Mar 24 '19 at 07:50
  • AttributeError: module 'os' has no attribute 'setsid' – keramat Feb 22 '20 at 06:50
  • this worked for me, when opening the file in 'rb' mode ```with open('../path/to/pdf','rb') as pdf: raw = str(parser.from_file(pdf)) text = raw.encode('utf-8', errors='ignore')``` – gl3yn Mar 31 '21 at 16:42
13

You may want to use time proved xPDF and derived tools to extract text instead as pyPDF2 seems to have various issues with the text extraction still.

The long answer is that there are lot of variations how a text is encoded inside PDF and that it may require to decoded PDF string itself, then may need to map with CMAP, then may need to analyze distance between words and letters etc.

In case the PDF is damaged (i.e. displaying the correct text but when copying it gives garbage) and you really need to extract text, then you may want to consider converting PDF into image (using ImageMagik) and then use Tesseract to get text from image using OCR.

Eugene
  • 2,820
  • 19
  • 24
10

PyPDF2 in some cases ignores the white spaces and makes the result text a mess, but I use PyMuPDF and I'm really satisfied you can use this link for more info

ehsaneha
  • 1,665
  • 13
  • 8
  • pymupdf is the best solution I observed, does not require additional C++ libraries like pdftotext or java like tika – Kay Oct 04 '19 at 13:56
  • pymypdf is really the best solution, no additional server or libraries, and it works with file where PyPDF2 PypDF3 PyPDF4 retrive empty string of text. many thanks! – Andrea Bisello Feb 26 '20 at 13:45
  • to install pymupdf, run `pip install pymupdf==1.16.16`. Using this specific version because today the newest version (17) is not working. I opted for pymupdf because it extracts text wrapping fields in new line char `\n`. So I'm extracting the text from pdf to a string with pymupdf and then I'm using `my_extracted_text.splitlines()` to get the text splitted in lines, into a list. – erickfis Apr 09 '20 at 13:53
  • PyMuPDF was really surprising. Thanks. – erfelipe May 04 '20 at 20:08
  • Page doesn't exist – Nouman Sep 22 '20 at 17:28
10

pdftotext is the best and simplest one! pdftotext also reserves the structure as well.

I tried PyPDF2, PDFMiner and a few others but none of them gave a satisfactory result.

Dharam
  • 267
  • 1
  • 4
  • 12
10

I found a solution here PDFLayoutTextStripper

It's good because it can keep the layout of the original PDF.

It's written in Java but I have added a Gateway to support Python.

Sample code:

from py4j.java_gateway import JavaGateway

gw = JavaGateway()
result = gw.entry_point.strip('samples/bus.pdf')

# result is a dict of {
#   'success': 'true' or 'false',
#   'payload': pdf file content if 'success' is 'true'
#   'error': error message if 'success' is 'false'
# }

print result['payload']

Sample output from PDFLayoutTextStripper: enter image description here

You can see more details here Stripper with Python

Tho
  • 23,158
  • 6
  • 60
  • 47
10

In 2020 the solutions above were not working for the particular pdf I was working with. Below is what did the trick. I am on Windows 10 and Python 3.8

Test pdf file: https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing

#pip install pdfminer.six
import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    '''Convert pdf content from a file path to text

    :path the file path
    '''
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()

    with io.StringIO() as retstr:
        with TextConverter(rsrcmgr, retstr, codec=codec,
                           laparams=laparams) as device:
            with open(path, 'rb') as fp:
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                password = ""
                maxpages = 0
                caching = True
                pagenos = set()

                for page in PDFPage.get_pages(fp,
                                              pagenos,
                                              maxpages=maxpages,
                                              password=password,
                                              caching=caching,
                                              check_extractable=True):
                    interpreter.process_page(page)

                return retstr.getvalue()


if __name__ == "__main__":
    print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf')) 
Jortega
  • 3,616
  • 1
  • 18
  • 21
  • Excellent answer. There's an anaconda install as well. I was installed and had extracted text in < 5 minutes. [note: tika also worked, but pdfminer.six was much faster) – CreekGeek Sep 21 '20 at 01:33
  • You are a lifesaver! – Sandeep Oct 21 '20 at 12:30
  • 1
    In 2023, 3 lines of `pypdf` do the same: [extract text with pypdf](https://pypdf.readthedocs.io/en/latest/user/extract-text.html) – Martin Thoma Mar 22 '23 at 09:25
9

The below code is a solution to the question in Python 3. Before running the code, make sure you have installed the pypdf library in your environment. If not installed, open the command prompt and run the following command (instead of pip you might need pip3):

pip install pypdf --upgrade

Solution Code using pypdf > 3.0.0:

import pypdf

reader = PyPDF2.PdfReader('sample.pdf')
for page in reader.pages:
    print(page.extract_text())
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Steffi Keran Rani J
  • 3,667
  • 4
  • 34
  • 56
8

pdfplumber is one of the better libraries to read and extract data from pdf. It also provides ways to read table data and after struggling with a lot of such libraries, pdfplumber worked best for me.

Mind you, it works best for machine-written pdf and not scanned pdf.

import pdfplumber
with pdfplumber.open(r'D:\examplepdf.pdf') as pdf:
first_page = pdf.pages[0]
print(first_page.extract_text())
Dharman
  • 30,962
  • 25
  • 85
  • 135
Aklank Jain
  • 1,002
  • 1
  • 13
  • 21
7

I've got a better work around than OCR and to maintain the page alignment while extracting the text from a PDF. Should be of help:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()


    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)


    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

text= convert_pdf_to_txt('test.pdf')
print(text)
Strayhorn
  • 687
  • 6
  • 16
  • Nb. The latest version [no longer uses the `codec` arg](https://stackoverflow.com/a/59497669/1461850) . I fixed this by removing it i.e. `device = TextConverter(rsrcmgr, retstr, laparams=laparams)` – Lee Jul 10 '20 at 12:56
6

Multi - page pdf can be extracted as text at single stretch instead of giving individual page number as argument using below code

import PyPDF2
import collections
pdf_file = open('samples.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
c = collections.Counter(range(number_of_pages))
for i in c:
   page = read_pdf.getPage(i)
   page_content = page.extractText()
   print page_content.encode('utf-8')
Yogi
  • 85
  • 1
  • 1
5

You can use PDFtoText https://github.com/jalan/pdftotext

PDF to text keeps text format indentation, doesn't matter if you have tables.

Máxima Alekz
  • 572
  • 10
  • 23
5

As of 2021 I would like to recommend pdfreader due to the fact that PyPDF2/3 seems to be troublesome now and tika is actually written in java and needs a jre in the background. pdfreader is pythonic, currently well maintained and has extensive documentation here.

Installation as usual: pip install pdfreader

Short example of usage:

from pdfreader import PDFDocument, SimplePDFViewer

# get raw document
fd = open(file_name, "rb")
doc = PDFDocument(fd)

# there is an iterator for pages
page_one = next(doc.pages())
all_pages = [p for p in doc.pages()]

# and even a viewer
fd = open(file_name, "rb")
viewer = SimplePDFViewer(fd)
harmonica141
  • 1,389
  • 2
  • 23
  • 27
  • On a note, installing `pdfreader` on Windows requires Microsoft C++ Build Tools installed on your system, whilst the answer below recommending `pymupdf` installed directly using `pip` without any extra requirement. – Raf Sep 21 '21 at 06:14
  • I couldnt use it on jupyter notebook, keeps crashing the kernel – West Mar 06 '22 at 20:31
4

If wanting to extract text from a table, I've found tabula to be easily implemented, accurate, and fast:

to get a pandas dataframe:

import tabula

df = tabula.read_pdf('your.pdf')

df

By default, it ignores page content outside of the table. So far, I've only tested on a single-page, single-table file, but there are kwargs to accommodate multiple pages and/or multiple tables.

install via:

pip install tabula-py
# or
conda install -c conda-forge tabula-py 

In terms of straight-up text extraction see: https://stackoverflow.com/a/63190886/9249533

CreekGeek
  • 1,809
  • 2
  • 14
  • 24
  • `tabula` is impressive. Of all the solutions I tested from this page, this is the only one that was able to maintain the order of rows and fields. There are still a few adjustments needed for complex tables, but since the output seems reproductible from one table to the other and is stored in a `pandas.DataFrame` it is easy to correct. – Skippy le Grand Gourou Feb 01 '21 at 16:15
  • Also check Camelot. – Skippy le Grand Gourou Feb 01 '21 at 17:25
3

Here is the simplest code for extracting text

code:

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open('filename.pdf', 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# printing number of pages in pdf file
print(pdfReader.numPages)

# creating a page object
pageObj = pdfReader.getPage(5)

# extracting text from page
print(pageObj.extractText())

# closing the pdf file object
pdfFileObj.close()
Tasneem Haider
  • 359
  • 3
  • 9
Infinity
  • 83
  • 1
  • 11
2

Use pdfminer.six. Here is the the doc : https://pdfminersix.readthedocs.io/en/latest/index.html

To convert pdf to text :

    def pdf_to_text():
        from pdfminer.high_level import extract_text

        text = extract_text('test.pdf')
        print(text)
alpha
  • 511
  • 8
  • 15
2

You can simply do this using pytessaract and OpenCV. Refer the following code. You can get more details from this article.

import os
from PIL import Image
from pdf2image import convert_from_path
import pytesseract

filePath = ‘021-DO-YOU-WONDER-ABOUT-RAIN-SNOW-SLEET-AND-HAIL-Free-Childrens-Book-By-Monkey-Pen.pdf’
doc = convert_from_path(filePath)

path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)

for page_number, page_data in enumerate(doc):
txt = pytesseract.image_to_string(page_data).encode(“utf-8”)
print(“Page # {} — {}”.format(str(page_number),txt))
0

I am adding code to accomplish this: It is working fine for me:

# This works in python 3
# required python packages
# tabula-py==1.0.0
# PyPDF2==1.26.0
# Pillow==4.0.0
# pdfminer.six==20170720

import os
import shutil
import warnings
from io import StringIO

import requests
import tabula
from PIL import Image
from PyPDF2 import PdfFileWriter, PdfFileReader
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

warnings.filterwarnings("ignore")


def download_file(url):
    local_filename = url.split('/')[-1]
    local_filename = local_filename.replace("%20", "_")
    r = requests.get(url, stream=True)
    print(r)
    with open(local_filename, 'wb') as f:
        shutil.copyfileobj(r.raw, f)

    return local_filename


class PDFExtractor():
    def __init__(self, url):
        self.url = url

    # Downloading File in local
    def break_pdf(self, filename, start_page=-1, end_page=-1):
        pdf_reader = PdfFileReader(open(filename, "rb"))
        # Reading each pdf one by one
        total_pages = pdf_reader.numPages
        if start_page == -1:
            start_page = 0
        elif start_page < 1 or start_page > total_pages:
            return "Start Page Selection Is Wrong"
        else:
            start_page = start_page - 1

        if end_page == -1:
            end_page = total_pages
        elif end_page < 1 or end_page > total_pages - 1:
            return "End Page Selection Is Wrong"
        else:
            end_page = end_page

        for i in range(start_page, end_page):
            output = PdfFileWriter()
            output.addPage(pdf_reader.getPage(i))
            with open(str(i + 1) + "_" + filename, "wb") as outputStream:
                output.write(outputStream)

    def extract_text_algo_1(self, file):
        pdf_reader = PdfFileReader(open(file, 'rb'))
        # creating a page object
        pageObj = pdf_reader.getPage(0)

        # extracting extract_text from page
        text = pageObj.extractText()
        text = text.replace("\n", "").replace("\t", "")
        return text

    def extract_text_algo_2(self, file):
        pdfResourceManager = PDFResourceManager()
        retstr = StringIO()
        la_params = LAParams()
        device = TextConverter(pdfResourceManager, retstr, codec='utf-8', laparams=la_params)
        fp = open(file, 'rb')
        interpreter = PDFPageInterpreter(pdfResourceManager, device)
        password = ""
        max_pages = 0
        caching = True
        page_num = set()

        for page in PDFPage.get_pages(fp, page_num, maxpages=max_pages, password=password, caching=caching,
                                      check_extractable=True):
            interpreter.process_page(page)

        text = retstr.getvalue()
        text = text.replace("\t", "").replace("\n", "")

        fp.close()
        device.close()
        retstr.close()
        return text

    def extract_text(self, file):
        text1 = self.extract_text_algo_1(file)
        text2 = self.extract_text_algo_2(file)

        if len(text2) > len(str(text1)):
            return text2
        else:
            return text1

    def extarct_table(self, file):

        # Read pdf into DataFrame
        try:
            df = tabula.read_pdf(file, output_format="csv")
        except:
            print("Error Reading Table")
            return

        print("\nPrinting Table Content: \n", df)
        print("\nDone Printing Table Content\n")

    def tiff_header_for_CCITT(self, width, height, img_size, CCITT_group=4):
        tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
        return struct.pack(tiff_header_struct,
                           b'II',  # Byte order indication: Little indian
                           42,  # Version number (always 42)
                           8,  # Offset to first IFD
                           8,  # Number of tags in IFD
                           256, 4, 1, width,  # ImageWidth, LONG, 1, width
                           257, 4, 1, height,  # ImageLength, LONG, 1, lenght
                           258, 3, 1, 1,  # BitsPerSample, SHORT, 1, 1
                           259, 3, 1, CCITT_group,  # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
                           262, 3, 1, 0,  # Threshholding, SHORT, 1, 0 = WhiteIsZero
                           273, 4, 1, struct.calcsize(tiff_header_struct),  # StripOffsets, LONG, 1, len of header
                           278, 4, 1, height,  # RowsPerStrip, LONG, 1, lenght
                           279, 4, 1, img_size,  # StripByteCounts, LONG, 1, size of extract_image
                           0  # last IFD
                           )

    def extract_image(self, filename):
        number = 1
        pdf_reader = PdfFileReader(open(filename, 'rb'))

        for i in range(0, pdf_reader.numPages):

            page = pdf_reader.getPage(i)

            try:
                xObject = page['/Resources']['/XObject'].getObject()
            except:
                print("No XObject Found")
                return

            for obj in xObject:

                try:

                    if xObject[obj]['/Subtype'] == '/Image':
                        size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                        data = xObject[obj]._data
                        if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                            mode = "RGB"
                        else:
                            mode = "P"

                        image_name = filename.split(".")[0] + str(number)

                        print(xObject[obj]['/Filter'])

                        if xObject[obj]['/Filter'] == '/FlateDecode':
                            data = xObject[obj].getData()
                            img = Image.frombytes(mode, size, data)
                            img.save(image_name + "_Flate.png")
                            # save_to_s3(imagename + "_Flate.png")
                            print("Image_Saved")

                            number += 1
                        elif xObject[obj]['/Filter'] == '/DCTDecode':
                            img = open(image_name + "_DCT.jpg", "wb")
                            img.write(data)
                            # save_to_s3(imagename + "_DCT.jpg")
                            img.close()
                            number += 1
                        elif xObject[obj]['/Filter'] == '/JPXDecode':
                            img = open(image_name + "_JPX.jp2", "wb")
                            img.write(data)
                            # save_to_s3(imagename + "_JPX.jp2")
                            img.close()
                            number += 1
                        elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                            if xObject[obj]['/DecodeParms']['/K'] == -1:
                                CCITT_group = 4
                            else:
                                CCITT_group = 3
                            width = xObject[obj]['/Width']
                            height = xObject[obj]['/Height']
                            data = xObject[obj]._data  # sorry, getData() does not work for CCITTFaxDecode
                            img_size = len(data)
                            tiff_header = self.tiff_header_for_CCITT(width, height, img_size, CCITT_group)
                            img_name = image_name + '_CCITT.tiff'
                            with open(img_name, 'wb') as img_file:
                                img_file.write(tiff_header + data)

                            # save_to_s3(img_name)
                            number += 1
                except:
                    continue

        return number

    def read_pages(self, start_page=-1, end_page=-1):

        # Downloading file locally
        downloaded_file = download_file(self.url)
        print(downloaded_file)

        # breaking PDF into number of pages in diff pdf files
        self.break_pdf(downloaded_file, start_page, end_page)

        # creating a pdf reader object
        pdf_reader = PdfFileReader(open(downloaded_file, 'rb'))

        # Reading each pdf one by one
        total_pages = pdf_reader.numPages

        if start_page == -1:
            start_page = 0
        elif start_page < 1 or start_page > total_pages:
            return "Start Page Selection Is Wrong"
        else:
            start_page = start_page - 1

        if end_page == -1:
            end_page = total_pages
        elif end_page < 1 or end_page > total_pages - 1:
            return "End Page Selection Is Wrong"
        else:
            end_page = end_page

        for i in range(start_page, end_page):
            # creating a page based filename
            file = str(i + 1) + "_" + downloaded_file

            print("\nStarting to Read Page: ", i + 1, "\n -----------===-------------")

            file_text = self.extract_text(file)
            print(file_text)
            self.extract_image(file)

            self.extarct_table(file)
            os.remove(file)
            print("Stopped Reading Page: ", i + 1, "\n -----------===-------------")

        os.remove(downloaded_file)


# I have tested on these 3 pdf files
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Healthcare-January-2017.pdf"
url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sample_Test.pdf"
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sazerac_FS_2017_06_30%20Annual.pdf"
# creating the instance of class
pdf_extractor = PDFExtractor(url)

# Getting desired data out
pdf_extractor.read_pages(15, 23)
Girish Gupta
  • 1,241
  • 13
  • 27
0

You can download tika-app-xxx.jar(latest) from Here.

Then put this .jar file in the same folder of your python script file.

then insert the following code in the script:

import os
import os.path

tika_dir=os.path.join(os.path.dirname(__file__),'<tika-app-xxx>.jar')

def extract_pdf(source_pdf:str,target_txt:str):
    os.system('java -jar '+tika_dir+' -t {} > {}'.format(source_pdf,target_txt))

The advantage of this method:

fewer dependency. Single .jar file is easier to manage that a python package.

multi-format support. The position source_pdf can be the directory of any kind of document. (.doc, .html, .odt, etc.)

up-to-date. tika-app.jar always release earlier than the relevant version of tika python package.

stable. It is far more stable and well-maintained (Powered by Apache) than PyPDF.

disadvantage:

A jre-headless is necessary.

pah8J
  • 807
  • 9
  • 15
  • totally not pythonic solution. If you recommend this, you should build a python package and have people import that. Don't recommend using command line executions of java code in python. – Michael Tamillow Dec 11 '18 at 04:30
  • @MichaelTamillow, if writing a code which is going to be uploaded into pypi, I admit that it is not a good idea. However, if it is just a python script with shebang for temporary usage, it is not bad, doesn't it? – pah8J Jan 15 '19 at 08:06
  • Well, the question isn't titled with "python" - so I think stating "here's how to do it in Java" is more acceptable than this. Technically, you can do whatever you want in Python. That's why it is both awesome and terrible. Temporary usage is a bad habit. – Michael Tamillow Jan 21 '19 at 19:27
0

If you try it in Anaconda on Windows, PyPDF2 might not handle some of the PDFs with non-standard structure or unicode characters. I recommend using the following code if you need to open and read a lot of pdf files - the text of all pdf files in folder with relative path .//pdfs// will be stored in list pdf_text_list.

from tika import parser
import glob

def read_pdf(filename):
    text = parser.from_file(filename)
    return(text)


all_files = glob.glob(".\\pdfs\\*.pdf")
pdf_text_list=[]
for i,file in enumerate(all_files):
    text=read_pdf(file)
    pdf_text_list.append(text['content'])

print(pdf_text_list)
Shayki Abramczyk
  • 36,824
  • 16
  • 89
  • 114
DovaX
  • 958
  • 11
  • 16
0

For extracting Text from PDF use below code

import PyPDF2
pdfFileObj = open('mypdf.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

print(pdfReader.numPages)

pageObj = pdfReader.getPage(0)

a = pageObj.extractText()

print(a)
Elavarasan r
  • 1,055
  • 2
  • 12
  • 22
  • 1
    [PyPDF2](https://github.com/mstamy2/PyPDF2/issues/571) / PyPDF3 / PyPDF4 are all dead. Use [pymupdf](https://stackoverflow.com/a/63518022/562769) – Martin Thoma Aug 21 '20 at 07:16
0

A more robust way, supposing there are multiple PDF's or just one !

import os
from PyPDF2 import PdfFileWriter, PdfFileReader
from io import BytesIO

mydir = # specify path to your directory where PDF or PDF's are

for arch in os.listdir(mydir): 
    buffer = io.BytesIO()
    archpath = os.path.join(mydir, arch)
    with open(archpath) as f:
            pdfFileObj = open(archpath, 'rb')
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
            pdfReader.numPages
            pageObj = pdfReader.getPage(0) 
            ley = pageObj.extractText()
            file1 = open("myfile.txt","w")
            file1.writelines(ley)
            file1.close()
            
Andres Ordorica
  • 302
  • 1
  • 5
0

Try out borb, a pure python PDF library

import typing  
from borb.pdf.document import Document  
from borb.pdf.pdf import PDF  
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction  


def main():

    # variable to hold Document instance
    doc: typing.Optional[Document] = None  

    # this implementation of EventListener handles text-rendering instructions
    l: SimpleTextExtraction = SimpleTextExtraction()  

    # open the document, passing along the array of listeners
    with open("input.pdf", "rb") as in_file_handle:  
        doc = PDF.loads(in_file_handle, [l])  
  
    # were we able to read the document?
    assert doc is not None  

    # print the text on page 0
    print(l.get_text(0))  

if __name__ == "__main__":
    main()

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
  • How do you get the total number of pages of the document with borb? (or how do you get the complete text directly?) – Martin Thoma May 14 '22 at 15:49
-1

PyPDF2 does work, but results may vary. I am seeing quite inconsistent findings from its result extraction.

reader=PyPDF2.pdf.PdfFileReader(self._path)
eachPageText=[]
for i in range(0,reader.getNumPages()):
    pageText=reader.getPage(i).extractText()
    print(pageText)
    eachPageText.append(pageText)
bmc
  • 817
  • 1
  • 12
  • 23
  • 1
    [PyPDF2](https://github.com/mstamy2/PyPDF2/issues/571) / PyPDF3 / PyPDF4 are all dead. Use [pymupdf](https://stackoverflow.com/a/63518022/562769) – Martin Thoma Aug 21 '20 at 07:18
-1

Camelot seems a fairly powerful solution to extract tables from PDFs in Python.

At first sight it seems to achieve almost as accurate extraction as the tabula-py package suggested by CreekGeek, which is already waaaaay above any other posted solution as of today in terms of reliability, but it is supposedly much more configurable. Furthermore it has its own accuracy indicator (results.parsing_report), and great debugging features.

Both Camelot and Tabula provide the results as Pandas’ DataFrames, so it is easy to adjust tables afterwards.

pip install camelot-py

(Not to be confused with the camelot package.)

import camelot

df_list = []
results = camelot.read_pdf("file.pdf", ...)
for table in results:
    print(table.parsing_report)
    df_list.append(results[0].df)

It can also output results as CSV, JSON, HTML or Excel.

Camelot comes at the expense of a number of dependencies.

NB : Since my input is pretty complex with many different tables I ended up using both Camelot and Tabula, depending on the table, to achieve the best results.

Skippy le Grand Gourou
  • 6,976
  • 4
  • 60
  • 76
-1

It includes creating a new sheet for each PDF page being set dynamically based on number of pages in the document.

import PyPDF2 as p2
import xlsxwriter

pdfFileName = "sample.pdf"
pdfFile = open(pdfFileName, 'rb')
pdfread = p2.PdfFileReader(pdfFile)
number_of_pages = pdfread.getNumPages()
workbook = xlsxwriter.Workbook('pdftoexcel.xlsx')

for page_number in range(number_of_pages):
    print(f'Sheet{page_number}')
    pageinfo = pdfread.getPage(page_number)
    rawInfo = pageinfo.extractText().split('\n')

    row = 0
    column = 0
    worksheet = workbook.add_worksheet(f'Sheet{page_number}')

    for line in rawInfo:
        worksheet.write(row, column, line)
        row += 1
workbook.close()
Daniel Danielecki
  • 8,508
  • 6
  • 68
  • 94
-1

Objectives: Extract text from PDF

Required Tools:

  1. Poppler for windows: wrapper for pdftotext file in windows for anaanaconda: conda install -c conda-forge

  2. pdftotext utility to convert PDF to text.

Steps: Install Poppler. For windows, Add “xxx/bin/” to env path pip install pdftotext

import pdftotext
 
# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)
 
# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))
Shaina Raza
  • 1,474
  • 17
  • 12
-1

Go through the official documentation there it is given

from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())
Mounesh
  • 561
  • 5
  • 18
-1

I will introduce another library that hasn't been mentioned yet, providing you with additional options. Extracting text from PDFs can also be achieved using IronPdf.

The IronPDF library can be added via pip. Use the command below to install IronPDF using pip:

pip install ironpdf

IronPDF Python relies on .NET 6.0, as its underlying technology. Therefore, it is necessary to have the .NET 6.0 SDK installed on your machine in order to use IronPDF Python.

from ironpdf import *
 
# Load existing PDF document
pdf = PdfDocument.FromFile("content.pdf")
 
# Extract text from PDF document
all_text = pdf.ExtractAllText()
 
# Extract text from specific page in the document
page_2_text = pdf.ExtractTextFromPage(1)

In the provided code snippet, the PDF document is imported, and a method is employed to extract text from the imported PDF document. This approach enables efficient text extraction from PDF files.

Library | Code example link

-8

How to extract text from a PDF file?

The first thing to understand is the PDF format. It has a public specification written in English, see ISO 32000-2:2017 and read the more than 700 pages of PDF 1.7 specification. You certainly at least need to read the wikipedia page about PDF

Once you understood the details of the PDF format, extracting text is more or less easy (but what about text appearing in figures or images; its figure 1)? Don't expect writing a perfect software text extractor alone in a few weeks....

On Linux, you might also use pdf2text which you could popen from your Python code.

In general, extracting text from a PDF file is an ill defined problem. For a human reader some text could be made (as a figure) from different dots, or a photo, etc...

The Google search engine is capable of extracting text from PDF, but is rumored to need more than half a billion lines of source code. Do you have the necessary resources (in man power, in budget) to develop a competitor?

A possibility might be to print the PDF to some virtual printer (e.g. using GhostScript or Firefox), then to use OCR techniques to extract text.

I would recommend instead to work on the data representation which has generated that PDF file, for example on the original LaTeX code (or Lout code) or on OOXML code.

In all cases, you need to budget at least several person years of software development.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • 3
    This is not an answer. It says read this 700-page document and doesn't give an approach for actually addressing the question. – v2v1 Oct 17 '20 at 00:28