0

I recreated the PDF reader program that is made in this video:

https://www.youtube.com/watch?v=itRLRfuL_PQ&t=447s

I tested it with some random pdf files I have on my PC, but the program extracts the text of only a few of them. Why is so? Is it a flaw in the program, am I missing something or maybe there are some specific pdf files that cannot be read by default?

Here is the full code:

import tkinter as tk
import PyPDF2
from PIL import ImageTk,Image
from tkinter.filedialog import askopenfile

root = tk.Tk()

root.geometry('+%d+%d'%(975,150))

canvas = tk.Canvas(root, width=600, height=300)
canvas.grid(columnspan=3, rowspan=3)

logo = Image.open("tkinterResources/logo.png")
logo = ImageTk.PhotoImage(logo)
logo_label = tk.Label(image=logo)
logo_label.image = logo

logo_label.grid(row=0, column=1)

instructions = tk.Label(root, text="Select a PDF file on yo ur computer to extract all its text", font="Raleway")
instructions.grid(row=1, column=0, columnspan=3)

def open_file():
    browse_text.set("loading...")
    file = askopenfile(parent=root, mode="rb", title="Choose a file", filetype=[("Pdf file", "*.pdf")])
    if file:
        read_pdf = PyPDF2.PdfFileReader(file)
        page = read_pdf.getPage(0)
        page_content = page.extractText()

        text_box = tk.Text(root, height=10, width=50, padx=15, pady=15)
        text_box.insert(1.0, page_content)
        text_box.tag_configure("center", justify="center")    
        text_box.tag_add("center", 1.0, "end")
        text_box.grid(row=3, column=1)

        browse_text.set("Browse")

browse_text = tk.StringVar()
browse_btn = tk.Button(root, textvariable = browse_text, 
font="Raleway", bg="#20bebe", fg="white", height=2, width=15, 
command=open_file)
browse_text.set("Browse")
browse_btn.grid(row=2, column=1)

canvas = tk.Canvas(root, width=600, height=250)
canvas.grid(columnspan=3)

root.mainloop()

I also created my own pdf file with just one line of text using OpenOffice and even that doesn't seem to work.

paulsm4
  • 114,292
  • 17
  • 138
  • 190
Naky
  • 43
  • 4
  • 1) Can you tell us exactly *HOW* your Python/TKinter app is "reading" .pdfs? What library are you using? What API calls are you making? Please show some code. 2) Have you considered that maybe you can't "read text from the .pdf" ... because the .pdfs in question have bitmapped images (vs. textual content)? PS: Thank you for editing your post, and updating: 1) you're using PyPDF2, 2) showing sample code :) – paulsm4 Oct 20 '21 at 20:39
  • @paulsm4 sorry, I just forgot to paste the code at first but i've done it now. The .pdfs in question, if I understand correctly, should not include any type of images, only text lines. Idk if maybe there are some hidden bitmapped images that maybe define the setting of the page, if so I just wonder how it exactly works. – Naky Oct 20 '21 at 20:46

1 Answers1

1

I would recommend using textract, the pypi package, that does all of this in a much better way. Please see my answer here for a very simple example. It works as claimed: "Extract text from any document. No muss. No fuss."

Example for completeness.

Make sure to install tesseractocr as well if you want to use the tesseract method in the example.


import textract
text = textract.process("test.pdf", method='tesseract')
print(text)
with open('textract-results.txt', 'w+') as f:
    f.write(str(text))
james-see
  • 12,210
  • 6
  • 40
  • 47
  • Do I also need to install Poppler to run this? I'm getting the "failed at code 127" error and one of the solutions I found is [this](https://stackoverflow.com/questions/63357517/textract-failed-with-exit-code-127-windows-10-pdftotext) – Naky Oct 20 '21 at 21:13
  • @Naky I just confirmed fresh install via `pip3 install textract` on python 3.9 that it works fine without anything extra installed. – james-see Oct 20 '21 at 21:25
  • @KJ where are you seeing that and what is wrong with mit license? I see that poppler is definitely perfectly fine to integrate at a low level: https://poppler.freedesktop.org/ – james-see Oct 20 '21 at 21:30
  • haha, i came to textract after trying every other solution out there and found them failing on certain types of pdfs. Including wkhtml2pdf, pypdf2, etc. so the extra dependancies in the install guide were very much necessary and paid off when textract just worked for everything I threw at it. Talking about 1000's of pdfs from many sources. – james-see Oct 21 '21 at 02:17
  • @jamescampbell I'm still trying to make it work, but without Poppler I keep getting the 127 code error. I would see the program working even on machines without it. Thanks for the help anyway I guess. – Naky Oct 21 '21 at 12:13
  • @Naky ? I dont understand. The textract docs tell you how to install it and the dependancies. It is very simple to follow the instructions and install. I would start another question and show your code and the error you are getting with textract, as I do not believe it has anything to do with poppler. – james-see Oct 21 '21 at 14:45