0

I want to convert lots of PDFs into text files. The formatting is very important and only Adobe Reader seems to get it right (PDFMiner or PyPDF2 do not.)

Is there a way to automate the "export as text" function from Adobe Reader?

martineau
  • 119,623
  • 25
  • 170
  • 301
Wessowang
  • 61
  • 1
  • 9
  • I doubt it. AFAIK [there's no command line argument](https://stackoverflow.com/questions/619158/adobe-reader-command-line-reference) to do it — which frankly doesn't surprise me because Adobe may not want anyone (understandably) doing this. You might be able to automate opening the PDF file and choosing any "export as text" functionality it may have in its GUI. – martineau Nov 08 '19 at 23:38

1 Answers1

0

The following code will do what you want for one file. I recommend organizing the script into a few little functions and then calling the functions in a loop to process many files. You'll need to install the keyboard library using pip, or some other tool.

import pathlib as pl
import os
import keyboard
import time
import io


KILL_KEY = 'esc'
read_path  = pl.Path("C:/Users/Sam/Downloads/WS-1401-IP.pdf")
####################################################################


write_path = pl.Path(str(read_path.parent/read_path.stem) + ".txt")
overwrite_file = os.path.exists(write_path)

# alt      -- activate keyboard shortcuts
# `F`      -- open file menu
# `v`      -- select "save as text" option
# keyboard.write(write_path)
# `alt+s`  -- save button
# `ctrl+w` -- close file


os.startfile(read_path)
time.sleep(1)
keyboard.press_and_release('alt')
time.sleep(1)
keyboard.press_and_release('f') # -- open file menu
time.sleep(1)
keyboard.press_and_release('v') # -- select "save as text" option
time.sleep(1)
keyboard.write(str(write_path))
time.sleep(1)
keyboard.press_and_release('alt+s')
time.sleep(2)
if overwrite_file:
    keyboard.press_and_release('y')

# wait for program to finish saving
waited_too_long = True
for _ in range(5):
    time.sleep(1)
    if os.path.exists(write_path):
        waited_too_long = False
        break

if waited_too_long:
    with io.StringIO() as ss:
        print(
            "program probably saved to somewhere other than",
            write_path,
            file = ss
        )
        msg = ss.getvalue()
    raise ValueError(msg)

keyboard.press_and_release('ctrl+w') # close the file
Toothpick Anemone
  • 4,290
  • 2
  • 20
  • 42
  • Thank you. I just needed to replace "f" and "v" because my Adobe Reader isnt in English and I added **`keyboard.press_and_release('ctrl+q')`** to close Adobe Afterwards – Wessowang Nov 09 '19 at 11:49