0

I wasn't sure how to post a "question" that I found an answer to, but thought that it might be worth sharing my solution to save others the time I spent in figuring out how to do this.

Essentially, I have a PDF (with lots of pages/ formatting) that I want to strip the text out of, and paste into something else. However, a simple copy/paste will still leave text in its columns and automatically insert paragraph spaces that you then need to press end, delete, space, then repeat sequence indefinitely. Well, that's what programming was made for - doing repeated tasks for you so you don't have to.

My answer is posted below. If anyone has a better solution please let me know!

RisaAudr
  • 53
  • 1
  • 5
  • 10

1 Answers1

0

Below I pasted my code from a vbscript that I generated to do so. You will still need to go back through your text file and fix some bits & pieces after running the script that didn't follow the standard template that you programmed for.

Also, I'll note that I used notepad++ to determine how (in windows) Adobe reader handled carriage returns versus line feed (since the distinction is rather blurred today). I reference this article and the answer by AAT, which helped me in understanding the difference. The accepted answer is useful when specifically referencing vbs.

REM Set constants, then open file and copy into a buffer (contents)
Const ForReading = 1, ForWriting = 2
Dim fs, txt, contents

Set fs = CreateObject("Scripting.FileSystemObject")
Set txt = fs.OpenTextFile("originalTextFile.txt", ForReading)
contents = txt.ReadAll
txt.Close

REM Replace a double carriage return with un-repeatable text that as placeholder
contents = Replace(contents, vbCrLf & vbCrLf, "$%^&")

REM then replace leftover carriage returns with blank, 
contents = Replace(contents, vbCrLf, "")
contents = Replace(contents, vbCrLf, "")

REM finally, restore original carriage returns for paragraph spacing
contents = Replace(contents, "$%^&", vbCrLf & vbCrLf)
contents = Replace(contents, "$%^&", vbCrLf & vbCrLf)

REM Write to file
Set txt = fs.OpenTextFile("textFileRemovedSpaces.txt", ForWriting)
txt.Write contents
txt.Close

MsgBox("Done!")

Step 1: Save pdf as a text file - this strips out the pictures/ etc. With Adobe Reader, do File -> Save as other -> Text.

Step 2: Save above as Something.vbs, and edit file names in script as appropriate. Make sure to also create the empty text file for the script to save the edited text in. Note in vbs, the text "REM" signifies a comment follows.

Step 3: Run Script.

Step 4: Profit!

I've find this useful, as it for the most part saves a lot of effort in editing a 300 page pdf that I needed to convert to a word document.

Again, if anyone has a better solution please let me know!

Community
  • 1
  • 1
RisaAudr
  • 53
  • 1
  • 5
  • 10