How to extract text from pdf in Python 3.7

Question

I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just extracting the text from the pdf file but I don't know how to do so.

What is currently the best and easiest way to extract text from a PDF file into a string? What library is best to use today and how can I do it?

I have tried using PyPDF2 but everytime I try to extract text from any page using extractText(), it returns empty strings. I have tried installing textract but I get errors because I need more libraries I think.

from PyPDF2 import PdfReader

reader = PdfReader("January2019.pdf")
page = reader.pages[0]
print(page.extract_text())

This prints empty strings when it should be printing the contents of the page

edit: This question was asked for a very old PyPDF2 version. New versions of PyPDF2 have improved text extraction a lot

How about searching through the questions already on SO? https://stackoverflow.com/questions/tagged/pypdf2 — lit, Apr 19 '19 at 20:35
Yes there is actual text all over the pdf that I can highlight. — RaV1oLLi, Apr 19 '19 at 20:40
@SyntaxVoidsupportsMonica PyPDF2 improved text extraction a lot. It's now pretty good. Please give it a shot :-) — Martin Thoma, Jul 30 '22 at 21:59
Also, the quote you gave from the docs is no longer applicable (I'm the maintainer of PyPDF2) — Martin Thoma, Jul 30 '22 at 21:59

score 51 · Answer 1 · answered Dec 18 '19 at 01:51

51

I have tried many methods but failed, include PyPDF2 and Tika. I finally found the module pdfplumber that is work for me, you also can try it.

Hope this will be helpful to you.

import pdfplumber
pdf = pdfplumber.open('pdffile.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)
pdf.close()

answered Dec 18 '19 at 01:51

Fly your ideas

531
1
4
4

Could you loop this solution for multiple folders with multiple pdfs and transform the results in dataframe or alike? I have a question about it if you could kindly look -> https://stackoverflow.com/questions/66224627/how-to-extract-text-from-pdfs-in-folders-with-python-and-save-them-in-dataframe – AHK Feb 16 '21 at 12:51
excellent package, much better than PyPDF2, thank you! – Aska May 23 '22 at 16:28

score 19 · Accepted Answer · answered Apr 19 '19 at 20:56

19

Using tika worked for me!

from tika import parser

rawText = parser.from_file('January2019.pdf')

rawList = rawText['content'].splitlines()

This made it really easy to extract separate each line in the bank statement into a list.

answered Apr 19 '19 at 20:56

RaV1oLLi

529
1
3
9

finally found a solution that worked for me. All of these other PDF scanners did not work for my use case, and that may be due to the formatting of the actual PDF. However, this tika package worked flawlessly. You will need to install the latest version of Java, as well as the Java tika server.jar file. Once you download the java tika server jar file you can run from cmd on windows, java -jar java-tika-server.jar to run the local server, then this package will work for python – dataviews May 27 '19 at 17:09
It is best thing I found, I have tried `PyPDF2`, `pdfminer` but is suits by purpose,because it gives line by line output. – Siddharth Das Jun 20 '19 at 08:00
I can confirm that tika is very nice choice. I like it for the simplicity and ability to extract links from pdf. However, for me I found even better way to do the job from Windows command line: "gswin64c -sDEVICE=txtwrite -o pdf2text.txt "sample.pdf"" ...provided you have gswin64c.exe installed and the path set correctly. It was installed on my machine, I just had to set the PATH. – Andrew Anderson Oct 01 '20 at 10:41

score 9 · Answer 3 · answered Aug 19 '20 at 12:30

9

If you are looking for a maintained, bigger project, have a look at PyMuPDF. Install it with pip install pymupdf and use it like this:

import fitz

def get_text(filepath: str) -> str:
    with fitz.open(filepath) as doc:
        text = ""
        for page in doc:
            text += page.getText().strip()
        return text

answered Aug 19 '20 at 12:30

Martin Thoma

124,992
159
614
958

1

you saved me from losing my sanity. I'm trying to open pdfs with arabic, Chinese, non English language and your solution preserved the characters, thank you – user1465073 Jan 12 '21 at 13:34
This solution seems more effective than PyPDF2. – arjun Mar 17 '22 at 13:42

score 3 · Answer 4 · answered May 14 '20 at 18:31

PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too. it says :

While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.

You could instead install and use pdfminer using

pip install pdfminer
or you can use another open source utility named pdftotext by xpdfreader. instructions to use the utility is given on the page.

you can download the command line tools from here and could use the pdftotext.exe utility using subprocess .detailed explanation for using subprocess is given here

score 1 · Answer 5 · answered Apr 19 '19 at 20:44

1

PyPDF2 does not read whole pdf correctly. You must use this code.

    import pdftotext

    pdfFileObj = open("January2019.pdf", 'rb')


    pdf = pdftotext.PDF(pdfFileObj)

    # Iterate over all the pages
    for page in pdf:
        print(page)

answered Apr 19 '19 at 20:44

Şafak Çıplak

889
6
12

score 1 · Answer 6 · answered Jul 31 '20 at 11:21

Here is an alternative solution in Windows 10, Python 3.8

Example test pdf: https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing

#pip install pdfminer.six
import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    '''Convert pdf content from a file path to text

    :path the file path
    '''
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()

    with io.StringIO() as retstr:
        with TextConverter(rsrcmgr, retstr, codec=codec,
                           laparams=laparams) as device:
            with open(path, 'rb') as fp:
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                password = ""
                maxpages = 0
                caching = True
                pagenos = set()

                for page in PDFPage.get_pages(fp,
                                              pagenos,
                                              maxpages=maxpages,
                                              password=password,
                                              caching=caching,
                                              check_extractable=True):
                    interpreter.process_page(page)

                return retstr.getvalue()


if __name__ == "__main__":
    print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))

score 0 · Answer 7 · answered Apr 25 '19 at 13:53

import pdftables_api
import os

c = pdftables_api.Client('MY-API-KEY')

file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"

for file in os.listdir(file_path):
    if file.endswith(".pdf"):
        c.xlsx(os.path.join(file_path,file), file+'.xlsx')

Go to https://pdftables.com to get an API key.

CSV, format=csv

XML, format=xml

HTML, format=html

XLSX, format=xlsx-single, format=xlsx-multiple

score 0 · Answer 8 · answered Dec 19 '19 at 18:56

Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""

try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        viewer.next()
except PageDoesNotExist:
    pass

score 0 · Answer 9 · edited May 22 '22 at 20:20

0

Try this:

in terminal execute command: pip install PyPDF2

import PyPDF2

reader = PyPDF2.PdfReader("mypdf.pdf")
for page in reader.pages:
    print(page.extract_text())

edited May 22 '22 at 20:20

Martin Thoma

124,992
159
614
958

answered Aug 01 '20 at 13:44

mamal

1,791
20
14

score 0 · Answer 10 · answered Feb 28 '21 at 15:16

I think this code will be exactly what you are looking for:

import requests, time, datetime, os, threading, sys, configparser
import glob
import pdfplumber

for filename in glob.glob("*.pdf"):
    pdf = pdfplumber.open(filename)
    OutputFile = filename.replace('.pdf','.txt')
    fx2=open(OutputFile, "a+")
    for i in range(0,10000,1):
        try:
            page = pdf.pages[i]
            text = page.extract_text()
            print(text)
            fx2.write(text)
        except Exception as e: 
            print(e)
    fx2.close()
    pdf.close()

How to extract text from pdf in Python 3.7

10 Answers10

Linked