Extract text content from PDF

Question

I have been extracting text from PDFs using pdftotext. I have also done this with Ghostscript. Recently, a utility provider changed their PDFs so a portion of it is not being extracted by these methods. Specifically, I'm missing the due date and total due. When I open the PDF in a reader, the 'missing' text can be highlighted, copied, and pasted into an external editor. When I open it in Acrobat Pro, and view the content (View -> Show/Hide -> Navigation Panes -> Content), the text I need is there. How can I get it out without manually copying and pasting? (which is not an option, because I'll be doing this on thousands of PDFs)?

Here an example of what I'm dealing with. I have removed all sensitive data:

link to PDF

EDIT: I noticed after posting this that when you follow the link to the file (hosted on Google Drive), it will allow you to select and copy most text on page, but not the stuff I'm missing. When you download the file, you are able to select the missing text in a PDF reader.

score 2 · Answer 1 · answered Feb 20 '13 at 17:29

2

Recent releases of Ghostscript have a txtwrite device which is probably worth trying.

answered Feb 20 '13 at 17:29

chrisl

461
2
3

I have tried the txtwrite device and it gives me the same results as pdftotext — still missing the due date and account total. – Ben Walker Feb 20 '13 at 17:31
What exactly is missing? I assumed it was the "Nov 12, 2012 - Dec 12, 2012", but I see that in the output from txtwrite. – chrisl Feb 20 '13 at 17:51
I'm missing the top right corner: "Please Pay By Dec 28, 2012" and "Total Due $1,839.42" – Ben Walker Feb 20 '13 at 18:05
1

That text is a Type 3 font, which evince/poppler doesn't render (Ghostscript does render it, but txtwrite doesn't seem to see it, which is weird). I don't have time to look at it in detail, and the engineer who looks after txtwrite is on holiday, but if you raise a bug, and attach the file ( http://bugs.ghostscript.com/ ), you should at least get an explanation of why it doesn't work - at some point. – chrisl Feb 20 '13 at 18:27
Is there a way to maybe convert all fonts to some basic font before running it through txtwrite? – Ben Walker Feb 20 '13 at 18:45
Hmm, no. As far as I can tell, the fact that the errant text is in a Type 3 font *shouldn't* affect txtwrite, but at a quick glance, that's the only obvious difference between the text that works, the text that doesn't. But to really be sure, someone needs to pick apart the PDF to see the actual structure of the file, and follow the text through from the PDF interpreter to txtwrite - as I said, I don't have time to do that just now, and even if I did, I'm not the right person to do it. But raising a bug would mean it was seen by the right engineer. – chrisl Feb 21 '13 at 07:33
1

The lack of text output was caused by some serious logical bugs in the txtwrite device when unable to extract Unicode information form the PDF file, and falling back all the way to the original character codes. This has now been fixed. I would ask that if you find a bug in Ghostscript, *please* report it to us so we can fix it..... The txtwrite device correctly extracts all the text in this document now. – KenS Feb 25 '13 at 16:40
@KenS Please i have a similar issues here https://stackoverflow.com/questions/49012170/how-to-add-a-separator-after-each-word-with-ghostscript-sdevice-txtwrite , but in my case, I feel if we can define an extra switch for a separator character, to be added after each word, it might make table parsing easier – Charles Okwuagwu Feb 27 '18 at 15:25
I've answered your question, I'm afraid that there are no 'words' in a PDF file, so this simply isn't possible. – KenS Feb 27 '18 at 17:14

score 1 · Accepted Answer · answered Mar 01 '13 at 18:19

1

I have solved this by getting the newest unreleased version of Ghostscript from git and building it. Now the txtwrite device gives me exactly what I need. Thanks to chrisl for his answer and comments leading me in the right direction.

answered Mar 01 '13 at 18:19

Ben Walker

2,037
5
34
56

Hello, Have you ever tried to remove images from a pdf so that the pdf will consist only the text? I am searching a way to do that. Do you have any solution using ghostScript or any other cli tool? Kindly help. – hussainb Dec 19 '13 at 09:55

Fred F · Answer 3 · 2013-02-22T21:10:54.380

There is a VERY HACKY method to extract the data, but it only works with the older version of ghostscript, like 8.51 or 8.62. In the older version of ghostscript, the PDF commands are defined in /lib/pdf_ops.ps The new version does something else.

A tested version of version 8.62 is available here.

http://sourceforge.net/projects/ghostscript/files/GPL%20Ghostscript/8.62/gs862w32.exe/download

The text you are after is printed using /Tj {} def and /TJ {} def by adding a dup == to the beginning of each definition. (This could be made more sophisticated) I also didn't bother to worry about the font warning messages, but these would be filtered out if the data were written to file.

Some words are split into pieces and individual letters because kerning is being done. Given time, this could also be filtered.

modified /Tj from pdf_ops.ps /Tj { dup == 0 0 moveto Show settextposition } bdef

modified /TJ from pdf_ops.ps

/TJ { dup == 
  0 0 moveto {
    dup type /stringtype eq {
      Show
    } { -1000 div
      currentfont /ScaleMatrix .knownget { 0 get mul } if
      0 Vexch rmoveto
    } ifelse
  } forall settextposition
} bdef

output

(Help a neighbor within your county each month by contributing to The Salvation )
(Army's Project SHARE and Georgia Power will match your gift. To help, simply check )
($1, $2, $5, or $10 on the return portion of this bill. Starting next month, your pledge )
(amount will be included on your monthly bill.)
(Our business offices will be closed on December 24 and 25 for Christmas and January )
(1 for New Year's Day. In case of an emergency, please call us at the number on your )
(bill 24 hours a day, 7 days a week.)
(PLEASE KEEP THIS PORTION FOR YOUR RECORDS.)
(PLEASE RETURN THIS PORTION WITH YOUR PAYMENT, MAKING SURE THE RETURN ADDRESS SHOWS IN THE ENVELOPE WINDOW.)
(Account Number)
(Mail To:)

Isn't postscript fun?

after trying this, I get "Can't find initialization file gs_init.ps" when trying to run ghostscript. Using 8.62. Also, my pdf_ops.ps was in lib\, not bin\. I assumed it should just stay in lib\. — Ben Walker, Feb 22 '13 at 19:03
Very strange, the gs_init.ps is read before ever getting to the pdf_ops.ps, so i suspect that could be an unrelated issue. Try running GS without the modification and see if that error goes away. The gs_init.ps gets read before `GPL Ghostscript 8.62` (2008-02-29) do you see that message? The pdf_ops.ps gets read after `This software comes with NO WARRANTY: see the file PUBLIC for details.` if the error happens before this message, there is definitely something else happening. Yes, the director should be \lib and not \bin and the file should remain in \lib. — Fred F, Feb 22 '13 at 20:51

Extract text content from PDF

3 Answers3