Reading pdf files line by line using python

Question

I used the following code to read the pdf file, but it does not read it. What could possibly be the reason?

from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")
contents = reader.pages[0].extractText().split("\n")
print(contents)

The output is [u''] instead of reading the content.

Does it work for other page numbers than 0? Are you sure there is text in the PDF, and not just images or graphics? — mkrieger1, Jul 12 '17 at 14:36

score 5 · Answer 1 · edited May 14 '22 at 12:02

5

import re
from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")

for page in reader.pages:
    text = page.extractText()
    text_lower = text.lower()
    for line in text_lower:
        if re.search("abc", line):
            print(line)

I use it to iterate page by page of pdf and search for key terms in it and process further.

edited May 14 '22 at 12:02

Martin Thoma

124,992
159
614
958

answered Jan 23 '18 at 12:47

Piyush Rumao

363
4
7

score 0 · Answer 2 · answered Jul 08 '17 at 04:16

May be this can help you to read PDF.

import pyPdf
def getPDFContent(path):
    content = ""
    pages = 10
    p = file(path, "rb")
    pdf_content = pyPdf.PdfFileReader(p)
    for i in range(0, pages):
        content += pdf_content.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

score 0 · Answer 3 · answered Oct 03 '17 at 17:04

I think you need to specify the disc name, it's missing in your directory. For example "D:/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf". I tried and I can read without any problem.

Or if you want to find the file path using the os module which you didn't really associate with your directory, you can try the following:

from PyPDF2 import PdfFileReader
import os

def find(name, path):
    for root, dirs, files in os.walk(path):
        if name in files:
            return os.path.join(root, name)

directory = find('106_2015_34-76357.pdf', 'D:/Users/Rahul/Desktop/Dfiles/')

f = open(directory, 'rb')

reader = PdfFileReader(f)

contents = reader.getPage(0).extractText().split('\n')

f.close()

print(contents)

The find function can be found in Nadia Alramli's answer here Find a file in python

Anush · Answer 4 · 2019-12-21T12:35:20.920

To Read the files from Multiple Folders in a directory, below code can be used- This Example is for reading pdf files:

import os
from tika import parser

path = "/usr/local/" # path directory
directory=os.path.join(path)
for r,d,f in os.walk(directory): #going through subdirectories
    for file in f:
        if ".pdf" in file:  # reading only PDF files
            file_join = os.path.join(r, file)   #getting full path 
            file_data = parser.from_file(file_join)     # parsing the PDF file 
            text = file_data['content']               # read the content 
            print(text)                  #print the content

It's overly complex to show a directory walk. Answer the question asked as well. — Kickaha, Dec 21 '19 at 12:04

score 0 · Answer 5 · answered Jan 27 '21 at 09:46

def getTextPDF(pdfFileName,password=''):
    import PyPDF2
    from PyPDF2 import PdfFileReader, PdfFileWriter
    from nltk import sent_tokenize
    """ Extract Text from pdf  """
    pdf_file=open(pdfFileName,'rb')
    read_pdf=PyPDF2.PdfFileReader(pdf_file)
    if password !='':
        read_pdf.decrypt(password)
    text=[]
    for i in range(0,read_pdf.getNumPages()):
        text.append(read_pdf.getPage(i).extractText())
    text = '\n'.join (text).replace("\n",'')
    text = sent_tokenize(text)
    return text

score 0 · Answer 6 · answered May 14 '22 at 11:59

0

The issue was one of two things: (1) The text was not on page one - hence a user error. (2) PyPDF2 failed to extract the text - hence a bug in PyPDF2.

Sadly, the second one still happens for some PDFs.

answered May 14 '22 at 11:59

Martin Thoma

124,992
159
614
958

Mayur Vora · Answer 7 · 2017-07-12T10:23:18.927

Hello Rahul Pipalia,

If not install PyPDF2 in your python so first install PyPDF2 after use this module.

Installation Steps for Ubuntu (Install python-pypdf)

First, open terminal
After type sudo apt-get install python-pypdf

Your Probelm Solution

Try this below code,

# Import Library
import PyPDF2

# Which you want to read file so give file name with ".pdf" extension
pdf_file = open('Your_Pdf_File_Name.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()

#Give page number of the pdf file (How many page in pdf file).
# @param Page_Nuber_of_the_PDF_file: Give page number here i.e 1
page = read_pdf.getPage(Page_Nuber_of_the_PDF_file)

page_content = page.extractText()

# Display content of the pdf
print page_content

Download the PDF from below link and try this code, https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1

I hope my answer is helpful.
If any query so comments, please.

Reading pdf files line by line using python

7 Answers7

Installation Steps for Ubuntu (Install python-pypdf)

Your Probelm Solution