I'm analysing a pdf and for some reason many of the words have random spaces in or none between after I move it to python. I'm using PdfReader from PyPDF2.
Examples:
Y ou’re sweet, but I feel fine.
I wish I feltas calmas you look.
The strange thing is, the spaces aren't present (or not present) in the pdf, but only after I collect it in python.
So my proposed solution is a grammar or spellchecking module that will look at some text like 'y ou' and make it 'you' (and 'asif' to 'as if'). It would be great if there were a way to only enable that spellchecking feature, because I don't want it to change other things in the pdf.
I welcome any other solutions (perhaps in the way I'm collecting the pdf).
My current code looks like this:
def all_pages1(num, start, stop):
global file
with open(f'example{num}.txt', 'w') as file:
path = "C:/example.pdf"
with open(path, mode = 'rb') as file2:
reader = PdfReader(file2)
for page in range(start, stop):
page1 = reader.pages[page]
text = page1.extractText()
main(num, text)
file2.close()
file.close()
pass
main()
does the actual searching that isn't relevant to my problem.