6

I used the following code to read the pdf file, but it does not read it. What could possibly be the reason?

from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")
contents = reader.pages[0].extractText().split("\n")
print(contents)

The output is [u''] instead of reading the content.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Rahul Pipalia
  • 71
  • 1
  • 2
  • 4
  • Does it work for other page numbers than 0? Are you sure there is text in the PDF, and not just images or graphics? – mkrieger1 Jul 12 '17 at 14:36

7 Answers7

5
import re
from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")

for page in reader.pages:
    text = page.extractText()
    text_lower = text.lower()
    for line in text_lower:
        if re.search("abc", line):
            print(line)

I use it to iterate page by page of pdf and search for key terms in it and process further.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Piyush Rumao
  • 363
  • 4
  • 7
0

May be this can help you to read PDF.

import pyPdf
def getPDFContent(path):
    content = ""
    pages = 10
    p = file(path, "rb")
    pdf_content = pyPdf.PdfFileReader(p)
    for i in range(0, pages):
        content += pdf_content.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content
Tejas Thakar
  • 585
  • 5
  • 19
0

I think you need to specify the disc name, it's missing in your directory. For example "D:/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf". I tried and I can read without any problem.

Or if you want to find the file path using the os module which you didn't really associate with your directory, you can try the following:

from PyPDF2 import PdfFileReader
import os

def find(name, path):
    for root, dirs, files in os.walk(path):
        if name in files:
            return os.path.join(root, name)

directory = find('106_2015_34-76357.pdf', 'D:/Users/Rahul/Desktop/Dfiles/')

f = open(directory, 'rb')

reader = PdfFileReader(f)

contents = reader.getPage(0).extractText().split('\n')

f.close()

print(contents)

The find function can be found in Nadia Alramli's answer here Find a file in python

Ahaha
  • 416
  • 1
  • 7
  • 14
0

To Read the files from Multiple Folders in a directory, below code can be used- This Example is for reading pdf files:

import os
from tika import parser

path = "/usr/local/" # path directory
directory=os.path.join(path)
for r,d,f in os.walk(directory): #going through subdirectories
    for file in f:
        if ".pdf" in file:  # reading only PDF files
            file_join = os.path.join(r, file)   #getting full path 
            file_data = parser.from_file(file_join)     # parsing the PDF file 
            text = file_data['content']               # read the content 
            print(text)                  #print the content
Anush
  • 79
  • 4
0
def getTextPDF(pdfFileName,password=''):
    import PyPDF2
    from PyPDF2 import PdfFileReader, PdfFileWriter
    from nltk import sent_tokenize
    """ Extract Text from pdf  """
    pdf_file=open(pdfFileName,'rb')
    read_pdf=PyPDF2.PdfFileReader(pdf_file)
    if password !='':
        read_pdf.decrypt(password)
    text=[]
    for i in range(0,read_pdf.getNumPages()):
        text.append(read_pdf.getPage(i).extractText())
    text = '\n'.join (text).replace("\n",'')
    text = sent_tokenize(text)
    return text
thrinadhn
  • 1,673
  • 22
  • 32
0

The issue was one of two things: (1) The text was not on page one - hence a user error. (2) PyPDF2 failed to extract the text - hence a bug in PyPDF2.

Sadly, the second one still happens for some PDFs.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
-2

Hello Rahul Pipalia,

If not install PyPDF2 in your python so first install PyPDF2 after use this module.

Installation Steps for Ubuntu (Install python-pypdf)

  1. First, open terminal
  2. After type sudo apt-get install python-pypdf

Your Probelm Solution

Try this below code,

# Import Library
import PyPDF2

# Which you want to read file so give file name with ".pdf" extension
pdf_file = open('Your_Pdf_File_Name.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()

#Give page number of the pdf file (How many page in pdf file).
# @param Page_Nuber_of_the_PDF_file: Give page number here i.e 1
page = read_pdf.getPage(Page_Nuber_of_the_PDF_file)

page_content = page.extractText()

# Display content of the pdf
print page_content

Download the PDF from below link and try this code, https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1

I hope my answer is helpful.
If any query so comments, please.

Mayur Vora
  • 922
  • 2
  • 14
  • 25