Extracting text from PDF and compare to dictionary

Question

I am currently working on a project where I want to extract text from a PDF and then check if one of the words in the extracted text appears in a certain dictionary. If so, I want to us example.replace(file, x, y) to replace the word from my text with the value from my dictionary.

I'm struggling with the loop for checking all words in my text and compare them to the dictionary automatically. The goal is that I don't have to type "old" and "new" on my own but the programme checks all words in the text and if it finds one in the dictionary "old" shall be the word from the text and "new" the value of the key. The manual version works.

Here is my code

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):

rsrcmgr = PDFResourceManager()

retstr = StringIO()
codec = 'utf-8'

laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()

for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

text = retstr.getvalue()

fp.close()
device.close()
retstr.close()
return text

dictionary = {"Die" : "Der", "Arbeitsfläche":"Platz"}


def convert(file, old, new):

translation = convert_pdf_to_txt(file).replace(old, new)
return translation

print(convert('mytest.pdf','Die' ,'Der'))

Thanks for help!

Naga kiran · Answer 1 · 2018-09-05T09:16:09.740

If your intention is to just replace the words of extracted text PDF with the Dictionary values, the solution might help you. Just pick out words which are intersected with the Dictionary keys and replace the values one by one.

import re
#text = Extracted text from PDF
text = r" with the loop for Die checking all words in my text and compare them to the dictionary automatically"
for key in set(text.split(' ')).intersection(dictionary.keys()):
    text = re.sub(key,dictionary[key],text)

score 1 · Accepted Answer · answered Sep 05 '18 at 11:24

1

Assuming your'e able to read the pdf file . You can store the data in a list using

list_voc = []

list_voc.extend(text.split())

now using a simple loop you can check if element of list belongs to the dictionary or not , and if it does then replace it.

indx=0
for i in pdf_vocab:
    if i in dictionary.keys():
        pdf_vocab[indx] = dictionary[i]
    indx = indx + 1

indx variable store the index of list, whenever the element(or word) is in dictionary we can replace that word, at that particular index.

answered Sep 05 '18 at 11:24

Sarthak Gupta

824
12
23

Hey, thanks for your answer. This solution is what I was looking for. I implemented it in my last function ("convert(...)") but unfortunately the code does not find any similar words in text and dictionary – Agostino Sep 09 '18 at 16:10
i am sorry for the late reply . In the above solution it also checks for the case(upper or lower) , you can try i.lower() in dictionary.keys().lower() , if you dont want it to be case sensitive. Otherwise it works fine for me. – Sarthak Gupta Sep 12 '18 at 07:57

score 0 · Answer 3 · answered Sep 05 '18 at 08:28

Since I am not allowed to comment...

This loop should help you.

for old, new in dictionary.items():
    # update text by replacing old with new

When replacing, you should be sure that only words are exchanged, otherwise it could happen that 'book': 'shoe' transforms the word 'bookmarket' to 'shoemarket'. The module re can help you here. https://docs.python.org/3/library/re.html

Actually this guy had the same problem solved. Search and replace with "whole word only" option

If you also want to exchange phrases, the order of the dictionary may be important; the dictionary {'I': 'you', 'I like': 'chicken'} would transform 'I like' into 'you like', although this may not be wanted.

Extracting text from PDF and compare to dictionary

3 Answers3