Is there Python module I can use to correct words that have random spaces in?

Question

I'm analysing a pdf and for some reason many of the words have random spaces in or none between after I move it to python. I'm using PdfReader from PyPDF2.

Examples: Y ou’re sweet, but I feel fine. I wish I feltas calmas you look.

The strange thing is, the spaces aren't present (or not present) in the pdf, but only after I collect it in python.

So my proposed solution is a grammar or spellchecking module that will look at some text like 'y ou' and make it 'you' (and 'asif' to 'as if'). It would be great if there were a way to only enable that spellchecking feature, because I don't want it to change other things in the pdf.

I welcome any other solutions (perhaps in the way I'm collecting the pdf).

My current code looks like this:

def all_pages1(num, start, stop):
    global file
    with open(f'example{num}.txt', 'w') as file:
        path = "C:/example.pdf"
        with open(path, mode = 'rb') as file2:
            reader = PdfReader(file2)
            for page in range(start, stop):
                page1 = reader.pages[page]
                text = page1.extractText()
                main(num, text)
        file2.close()
    file.close()
    pass

main() does the actual searching that isn't relevant to my problem.

score 1 · Answer 1 · answered Mar 17 '23 at 20:42

disclaimer: I am the author of borb, the library used in this answer.

PDF is not a WYSISYG (what you see is what you get) format.

If you open a webpage, you can expect to see <p> elements containing text exactly as it is rendered on the page (and conversely, exactly as you would expect to extract it).

In a PDF however, you will find rendering instructions. In pseudo-code, you would find something like:

go to 50, 600
set the active stroke color to black
set the active font to Helvetica, size 12
render the character 'H'
move right 14 dots
render the character 'e'
etc

important spaces can be realized simply by moving to the left, rather than actually rendering the character <space>.

Whenever a PDF library needs to extract text from a PDF, it will essentially loop over all rendering instructions and store them. It will then sort them in logical reading order (top to bottom, left to right).

Then it needs to determine whether to insert a space between the previously extracted text and the next character. To do so, it will ask the active font "how big is a space character?", it will compare that to the distance between the previous character and the new one.

e.g.

'AB' : the horizontal distance is 5, the space width of Helvetica 12 is 120, the characters do not need a space between them 'A B' : the horizontal distance is 125, hence a space is inserted

Fonts are a mess in PDF. So I imagine the font in your PDF documents might simply be "broken". Which then causes text-extraction algorithms to have to "guess" the width of a space character.

There are various ways of doing this:

estimate the width based on the width of other characters
use a default
check whether the font happens to be monospaced
etc

All of these might be reasons why text-extraction is failing.

You can try borb to see whether that fixes the problem.

#!chapter_005/src/snippet_005.py
import typing
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleTextExtraction


def main():

    # read the Document
    doc: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleTextExtraction()
    with open("output.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [l])

    # check whether we have read a Document
    assert doc is not None

    # print the text on the first Page
    print(l.get_text()[0])


if __name__ == "__main__":
    main()

Thanks, this is great information. Could you explain what the lines 'doc: typing' to 'simpletextextraction()' do? I'm a but confused by the colons and what l does/is. — Rishi B, Mar 17 '23 at 21:40
To be clear, I'm just curious as to how it works. Also why when I specify a page on the pdf is the printed page inconsistently incorrect (+17 one time, -15 another)? It but it seems that my pdf is just terribly formatted since the same mistakes are happening. ¯\_(ツ)_/¯ — Rishi B, Mar 17 '23 at 22:06
All my code is typed. That's why it says "doc: typing". I am telling the python interpreter "I expect the variable doc to have the type Document, or the value None". — Joris Schellekens, Mar 18 '23 at 08:17

Is there Python module I can use to correct words that have random spaces in?

1 Answers1