I have a problem when using pypdf when looking for the amount of times a specific word is in a pdf file.
In my code, it founds the amount of times a word is, but only one time a page. So the maximum is the amount of pages. The word "the" should result in around 700, but only shows 30 (the amount of page is 30).
import PyPDF3
import re
def read_pdf(file,string):
fils = file.split(".")
print(fils[1])
word = string
if fils[1] == "pdf":
pdfFileObj = open(file,"rb")
# open the pdf file
object = PyPDF3.PdfFileReader(file)
# get number of pages
NumPages = object.getNumPages()
# define keyterms
counter = 0
# extract text and do the search
for i in range(NumPages):
PageObj = object.getPage(i)
print("page " + str(i))
Text = PageObj.extractText()
#print(Text)
if word in Text:
print("The word is on this page")
counter += 1
print(word, "exists", counter, "times in the file")
Can you guys see what i have done wrong and help me with it?
Thanks :)