Issues with extracting plain text from pdf

Question

I am working on creating a script to extract plain text from the PDFs supplied by my professor for lectures from an online class. Ultimately I would like to feed it into a speech to text engine to have a audio file to listen to it on the go but I can't get any of the Python pdf modules to produce the desired plain text equivalent.

import pdftotext



pdf_object = open("Chapter3Totalcopy.pdf", "rb")


pdf = pdftotext.PDF(pdf_object)

# Iterate over all the pages
for page in pdf:
    print(page)
    break # Added break to just show first page instead of the 96 page PDF

This is the text from the PDF.

We all belong to organizations of some sort. Whether its work, military, scouts, soccer league, book club, or some sports team. And all organizations have certain characteristics: routines and business process, politics, culture, reciprocal relationship with environments, and structure. This chapter begins by dissecting an organization from both a technical and behavioral point of view. The technical definition focuses on three elements: capital and labor; inputs from the environment; and outputs to the environment. The behavioral view emphasizes group relationships, values, and structures. These two definitions are not contradictory. The technical definition focuses on thousands of firms in competitive markets whereas the behavioral definition focuses on individual firms and an gaiai ie kigs.

The issue comes up in the last sentence which should end, "focuses on individual firms and an organization's inner workings." As you can see above in the copy from the PDF it gets garbled. Below is the output of the pdf when run through the script.

We all belong to organizations of some sort. Whether its work, military, scouts, soccer league, book club, or some sports team. And all organizations have certain characteristics: routines and business process, politics, culture, reciprocal relationship with environments, and structure. This chapter begins by dissecting an organization from both a technical and behavioral point of view. The technical definition focuses on three elements: capital and labor; inputs from the environment; and outputs to the environment. The behavioral view emphasizes group relationships, values, and structures. These two definitions are not contradictory. The technical definition focuses on thousands of firms in competitive markets whereas the behavioral definition focuses on individual firms and an ga i a i i e ki gs.

This seems to happen any time there is an apostrophe and the output change does not seem to follow a structure. I have tried using many different PDF modules (pdfminer, PyPDF2, etc.) and they all produce the same result.

edit - also tried running it through calibre and got the same result.

Any help would be appreciated.

this looks like an encoding or mapping problem in the pdf. Please share it for analysis. — mkl, Sep 15 '20 at 04:43
You can find the file [here](https://wetransfer.com/downloads/79895ad004ab7e83f19d7d1a58d72e7a20200917025930/77743e), its a limited download window. — eNgE, Sep 17 '20 at 03:02
Sorry, try this one [here] (https://www.dropbox.com/s/qiksngp9cw5evun/Chapter%203%20Total.pdf?dl=0) — eNgE, Sep 17 '20 at 20:42
The lines with issues use a different PDF-internal font object that the one without issues. Both font objects represent subsets of the same base font, so the visual representation is frictionless, but those font objects differ in the amount of extra data they bring along. In particular the font object used in the lines with issues only has an incomplete **ToUnicode** mapping, and this mapping is pivotal for the process of text extraction: For each character code without an entry in that map you get a '' during text extraction. Thus, this is an issue of the PDF, not of the text extractors. — mkl, Sep 18 '20 at 08:24
One can try to manually repair these incomplete maps. I don't know which python tools can be used for such a repair, [this answer](https://stackoverflow.com/a/39644941/1729265) illustrates the repair using PDFBox and Java. — mkl, Sep 18 '20 at 08:27
By the way, the PDF has some internal errors, too, the PDF object cross reference tables contain multiple entries pointing to offset 0 in the file. This cannot be correct. — mkl, Sep 18 '20 at 08:32
Thank you, your PDF knowledge and insights are appreciated. I figured it was something wrong with the PDF when none of the Python modules worked. The mapping makes sense since it seems to happen on apostrophes. So at this point the only clear route to automating extracting the text would be OCR which is something that I had been looking at already. That and getting the professor to give me the originals before they were converted to PDF. — eNgE, Sep 18 '20 at 15:02

Issues with extracting plain text from pdf

0 Answers0