1

I'm looking for a way to to open a pdf in chrome, select all, and copy the contents to write to a text file. I understand this is a very hacky approach, but I've tried pdftotext and textract libraries for reading pdf text already, and manually doing select all and copy/paste in chrome has read text in my multiple files most consistently.

This is what I have so far:

import os
import subprocess

# open file in chrome
cmd = """osascript -e 'tell application "System Events" to keystroke "a" using {command down}'"""
p = subprocess.Popen(['open', '-na', 'Google Chrome', '--args', '--new-window', f'{pdf_f}'])
time.sleep(1)
# select all
os.system(cmd)
time.sleep(1)
# copy
cmd = """osascript -e 'tell application "System Events" to keystroke "c" using {command down}'"""
os.system(cmd)

This visibly looks to work, opening the pdf in chrome then showing all of the text selected, but the text isn't being copied. I can't tell if its from the copy command or when the new chrome window opens, the focus is on the window and not on the pdf file within the window.

martineau
  • 119,623
  • 25
  • 170
  • 301
PL3
  • 413
  • 1
  • 5
  • 15
  • The extra hop of copying into chrome doesnt seem very efficient. Have you evaluated other python pdfs libraries such as `PyPDF2` and the `PdfFileReader` class? https://pypi.org/project/PyPDF2/#description. Also, other helpful answers may be here: https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file – user9074332 Jan 20 '19 at 21:38
  • 1
    Yeah I tried those too, but unfortunately they weren't reading the text in a consistent manner with my files. I tried opening with a few different apps and chrome copied the text in the best way for me to parse the text later with regex, so decided to go that route. – PL3 Jan 21 '19 at 02:34

1 Answers1

2

Found a way:

for fnm in fnms:
    pdf_f = path/'data'/'pdfs'/f'{fnm}'
    # open file in chrome
    p = subprocess.Popen(['open', '-na', 'Google Chrome', f'{pdf_f}'])
    time.sleep(1)
    # click
    pyautogui.moveTo(screen_width//2, screen_height//2)
    pyautogui.click()
    # select all
    pyautogui.hotkey('command', 'a')
    # copy
    pyautogui.hotkey('command', 'c')
    # write txt file
    clipboard_to_txt(path/'data'/'txts'/(fnm[:-3]+'txt'))
    # close tab
    pyautogui.hotkey('command', 'w')
PL3
  • 413
  • 1
  • 5
  • 15