0

I have a problem when using pypdf when looking for the amount of times a specific word is in a pdf file.

In my code, it founds the amount of times a word is, but only one time a page. So the maximum is the amount of pages. The word "the" should result in around 700, but only shows 30 (the amount of page is 30).

import PyPDF3
import re
def read_pdf(file,string):
    fils = file.split(".")
    print(fils[1])
    word = string
    if fils[1] == "pdf":
        pdfFileObj = open(file,"rb")
    # open the pdf file
        object = PyPDF3.PdfFileReader(file)
    # get number of pages
        NumPages = object.getNumPages()

    # define keyterms
        counter = 0
    # extract text and do the search
        for i in range(NumPages):
            PageObj = object.getPage(i)
            print("page " + str(i))
            Text = PageObj.extractText()
            #print(Text)
            if word in Text:
                print("The word is on this page")
                counter += 1
        print(word, "exists", counter, "times in the file")

Can you guys see what i have done wrong and help me with it?

Thanks :)

Underoos
  • 4,708
  • 8
  • 42
  • 85
VTP
  • 11
  • 1
  • 4
  • 1
    You read the text of one page into `Text` and check if `word in Text` - so if the word is `the` and the `Text = "the the the the the the the the the "` you add `1`. You need to _count_ how often `'the'` is in Text - and add that counted amount - not `1` . – Patrick Artner Mar 01 '19 at 12:11
  • [python-finding-word-frequencies-of-list-of-words-in-text-file](https://stackoverflow.com/questions/14921436/python-finding-word-frequencies-of-list-of-words-in-text-file) - this handles the more complex case of finding word counts of a list of words but you can simplfy the given answers to fit your need – Patrick Artner Mar 01 '19 at 12:14

1 Answers1

0

What you need to do is to collect ALL words from ALL pages into a list.
Once you have the list of words you can use Counter that will give you the words and their number in the pdf

Example:

from collections import Counter

pdf_words = ['the','fox','the','jack']

counter = Counter(pdf_words)
print(counter)

Output:

Counter({'the': 2, 'fox': 1, 'jack': 1})
balderman
  • 22,927
  • 7
  • 34
  • 52