I am working on creating a script to extract plain text from the PDFs supplied by my professor for lectures from an online class. Ultimately I would like to feed it into a speech to text engine to have a audio file to listen to it on the go but I can't get any of the Python pdf modules to produce the desired plain text equivalent.
import pdftotext
pdf_object = open("Chapter3Totalcopy.pdf", "rb")
pdf = pdftotext.PDF(pdf_object)
# Iterate over all the pages
for page in pdf:
print(page)
break # Added break to just show first page instead of the 96 page PDF
This is the text from the PDF.
We all belong to organizations of some sort. Whether its work, military, scouts, soccer league, book club, or some sports team. And all organizations have certain characteristics: routines and business process, politics, culture, reciprocal relationship with environments, and structure. This chapter begins by dissecting an organization from both a technical and behavioral point of view. The technical definition focuses on three elements: capital and labor; inputs from the environment; and outputs to the environment. The behavioral view emphasizes group relationships, values, and structures. These two definitions are not contradictory. The technical definition focuses on thousands of firms in competitive markets whereas the behavioral definition focuses on individual firms and an gaiai ie kigs.
The issue comes up in the last sentence which should end, "focuses on individual firms and an organization's inner workings." As you can see above in the copy from the PDF it gets garbled. Below is the output of the pdf when run through the script.
We all belong to organizations of some sort. Whether its work, military, scouts, soccer league, book club, or some sports team. And all organizations have certain characteristics: routines and business process, politics, culture, reciprocal relationship with environments, and structure. This chapter begins by dissecting an organization from both a technical and behavioral point of view. The technical definition focuses on three elements: capital and labor; inputs from the environment; and outputs to the environment. The behavioral view emphasizes group relationships, values, and structures. These two definitions are not contradictory. The technical definition focuses on thousands of firms in competitive markets whereas the behavioral definition focuses on individual firms and an ga i a i i e ki gs.
This seems to happen any time there is an apostrophe and the output change does not seem to follow a structure. I have tried using many different PDF modules (pdfminer, PyPDF2, etc.) and they all produce the same result.
edit - also tried running it through calibre and got the same result.
Any help would be appreciated.