1

I need to extract text from a PDF. I tried the PyPDF2, but the textExtract method returned an encrypted text, even though the pdf is not encrypted acoording to the isEncrypted method.

So I moved on to trying accessing a program that does the job from the command prompt, so I could call it from python with the subprocess module. I found this program called textExtract, which did the job I wanted with the following command line on cmd:

"textextract.exe" "download.pdf" /to "download.txt"

However, when I tried running it with subprocess I couldn't get a 0 return code.

Here is the code I tried:

textextract = shlex.split(r'"textextract.exe" "download.pdf" /to "download.txt"')
subprocess.run(textextract)

I already tried it with shell=True, but it didn't work. Can anyone help me?

martineau
  • 119,623
  • 25
  • 170
  • 301
  • What's the `/to` you're using do or mean? – martineau Oct 09 '18 at 23:45
  • the /to is part of the syntax of the program. It tells the textextract to convert the pdf file to the txt. – Renato da Silva Oct 10 '18 at 00:17
  • I tried running cmd.exe via subprocess but it got into an endless look (don't know why) and didn't work... – Renato da Silva Oct 10 '18 at 00:21
  • If `/to` is an argument to the process being started, it should have quotes around it when passed to `subprocess` as one of the arguments. Similar to what's in my answer to the question [cmd to run exe not working from Python](https://stackoverflow.com/questions/32150690/cmd-to-run-exe-not-working-from-python) which shows an example of passing them. – martineau Oct 10 '18 at 01:50
  • Here's [another example](https://stackoverflow.com/a/15207409/355230). – martineau Oct 10 '18 at 01:55
  • shlex alredy puts it into quotes. I tried changing it anyway and still couldn't get the expected result. – Renato da Silva Oct 10 '18 at 13:19
  • I tried changing the quotes to double quoting '" "' and now I get PermissionError: [WinError 5] Access Denied – Renato da Silva Oct 10 '18 at 13:56
  • I suggest you try `shlex.split('"textextract.exe" "download.pdf" "/to" "download.txt"', posix=False)`. – martineau Oct 10 '18 at 14:12
  • still getting PermissionError: [WinError 5] Access Denied – Renato da Silva Oct 11 '18 at 12:00
  • I'm thinking maybe it's the cmd security permissions...tried to change it, but windows won't let me... – Renato da Silva Oct 11 '18 at 12:36
  • Where did you get the `textextract.exe` utility? I may be able to help you if I can obtain a copy for testing purposes. – martineau Oct 11 '18 at 17:51
  • https://download.cnet.com/PDF-to-Text/3001-18497_4-75415960.html – Renato da Silva Oct 12 '18 at 00:44
  • Renato: That link is to download something named `pdftotext.exe`, not `textextract.exe`—so I don't understand why you posted it. – martineau Oct 12 '18 at 17:39
  • The installation exe file has a different name than the installed running exe file, which is textextract.exe. Thanks for your help! – Renato da Silva Oct 12 '18 at 20:56
  • All I ended up with as a GUI program named `pdftotext.exe` after running the downloaded installer program—so I think you are mistaken. – martineau Oct 12 '18 at 21:01
  • My bad. I tried several and guess I confused the sites. I think it's this one: https://baixar.freedownloadmanager.org/Windows-PC/PDF2Text-Pilot/GRATUITO-3.0.1.html – Renato da Silva Oct 12 '18 at 23:02
  • Renato: OK, I'll download that and take another look—however it will likely be a while before I get to it... – martineau Oct 12 '18 at 23:44
  • No worries. Thanks a lot for all your effort. Appreciate it. By the way, don't know if it may be relevant, but i'm using spyder on anaconda package. – Renato da Silva Oct 13 '18 at 13:42

2 Answers2

1

I was able to get the following script to work from the command line after installing the PDF2Text Pilot application you're trying to use:

import shlex
import subprocess

args = shlex.split(r'"textextract.exe" "download.pdf" /to "download.txt"')
print('args:', args)
subprocess.run(args)

Sample screen output of running it from a command line session:

> C:\Python3\python run-textextract.py
args: ['textextract.exe', 'download.pdf', '/to', 'download.txt']
Progress:
Text from "download.pdf" has been successfully extracted...
Text extraction has been completed!

The above output was generated using Python 3.7.0.

I don't know if your use of spyder on anaconda affects things or not since I'm not familiar with it/them. If you continue to have problems with this, then, if it's possible, I suggest you see if you can get things working directly—i.e. running the the Python interpreter on the script manually from the command line similar to what's shown above. If that works, but using spyder doesn't, then you'll at least know the cause of the problem.

martineau
  • 119,623
  • 25
  • 170
  • 301
0

There's no need to build a string of quoted strings and then parse that back out to a list of strings. Just create a list and pass that:

command=["textextract.exe", "download.pdf", "/to", "download.txt"]
subprocess.run(command)

All that shlex.split is doing is creating a list by removing all of the quotes you had to add when creating the string in the first place. That's an extra step that provides no value over just creating the list yourself.

Bryan Oakley
  • 370,779
  • 53
  • 539
  • 685