0

I am trying to parse a pdf document and extract values against certain keywords and I am doing it step by step. Below is the code that I have come up so far where I am trying to create a list of words that match the keywords. However the output that I get is that a complete line is being considered as word.

import PyPDF2,nltk
import easytextract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords


pdfFileObj = open('C:\\mydoc.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()

keywords = [word for word in word_tokenize(text,'english',False)]
print(keywords)

Output=['OfficeNotice', 'AbcBANKLTD', 'ThisOfferingNoticerelatestothe'] my expected output is as below

['Office','Notice','Abc','BANK','LTD','This','Offering','Notice','relates','to','the']

1 Answers1

0

As you can read in the documentation of extractText method you use:

Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Summarizing: the method is implemented in a simplistic way and it won't get you what you want unless you are willing to dive into PDF structure documentation and improve the library. It would be a worthy gift to the community should you decide to do it.

Should you want to explore the alternatives - check a very similar question.

sophros
  • 14,672
  • 11
  • 46
  • 75