Below I pasted my code from a vbscript that I generated to do so. You will still need to go back through your text file and fix some bits & pieces after running the script that didn't follow the standard template that you programmed for.
Also, I'll note that I used notepad++ to determine how (in windows) Adobe reader handled carriage returns versus line feed (since the distinction is rather blurred today). I reference this article and the answer by AAT
, which helped me in understanding the difference. The accepted answer is useful when specifically referencing vbs.
REM Set constants, then open file and copy into a buffer (contents)
Const ForReading = 1, ForWriting = 2
Dim fs, txt, contents
Set fs = CreateObject("Scripting.FileSystemObject")
Set txt = fs.OpenTextFile("originalTextFile.txt", ForReading)
contents = txt.ReadAll
txt.Close
REM Replace a double carriage return with un-repeatable text that as placeholder
contents = Replace(contents, vbCrLf & vbCrLf, "$%^&")
REM then replace leftover carriage returns with blank,
contents = Replace(contents, vbCrLf, "")
contents = Replace(contents, vbCrLf, "")
REM finally, restore original carriage returns for paragraph spacing
contents = Replace(contents, "$%^&", vbCrLf & vbCrLf)
contents = Replace(contents, "$%^&", vbCrLf & vbCrLf)
REM Write to file
Set txt = fs.OpenTextFile("textFileRemovedSpaces.txt", ForWriting)
txt.Write contents
txt.Close
MsgBox("Done!")
Step 1: Save pdf as a text file - this strips out the pictures/ etc. With Adobe Reader, do File -> Save as other -> Text.
Step 2: Save above as Something.vbs
, and edit file names in script as appropriate. Make sure to also create the empty text file for the script to save the edited text in. Note in vbs, the text "REM
" signifies a comment follows.
Step 3: Run Script.
Step 4: Profit!
I've find this useful, as it for the most part saves a lot of effort in editing a 300 page pdf that I needed to convert to a word document.
Again, if anyone has a better solution please let me know!