I am trying to parse a pdf document and extract values against certain keywords and I am doing it step by step. Below is the code that I have come up so far where I am trying to create a list of words that match the keywords. However the output that I get is that a complete line is being considered as word.
import PyPDF2,nltk
import easytextract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
pdfFileObj = open('C:\\mydoc.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
keywords = [word for word in word_tokenize(text,'english',False)]
print(keywords)
Output=['OfficeNotice', 'AbcBANKLTD', 'ThisOfferingNoticerelatestothe']
my expected output is as below
['Office','Notice','Abc','BANK','LTD','This','Offering','Notice','relates','to','the']