Extract text from PDF files(Printed)

Question

I'm using RedMon(Redirection Port Monitor), HP Universal Driver PS and GhostScript to intercept document printing.

However, for the following scenario:

File PDF -> HP Universal Driver PS -> RedMon -> PostScript File** -> GhostScript create file printed.pdf*.

* Can not extract text from PDF file: gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=output.txt printed.pdf

** The PostScript file is created as compacted and can not extract the text.

Question is?

Can I create a PostScript file without compacting when a PDF is sent to the printer?

Observation: Printed.pdf -> Image(TIFF) -> Tesseract(OCR) -> Text File... Works! But it is slow.

If I understand correctly you have a pdf file and you want it converted to text. You might take a look at this https://stackoverflow.com/questions/6187250/pdf-text-extraction — Dweeberly, Jun 14 '18 at 01:35
I've tried every form of the link you submitted. However the PDF-> PS-> PDF-> Text scenario does not work — Marc.Adans, Jun 19 '18 at 20:17

score 0 · Answer 1 · answered Jun 14 '18 at 07:22

0

As Dweeberly says in the comments, if you want to extract text from a PDF file, do not start by printing it. Especially do not turn it into PostScript.

PDF files can have ToUnicode CMaps in the (its optional) and these allow reliable text extraction. PostScript doesn't support these and so the information is lost if you create a PostScript file from the PDF (no matter what means you use to create hte PostScript).

In addition the PostScript program will usually be created with subset fonts, non-standard Encodings and other modifications to the text which wil make it hard, or impossible, to extract text from it.

Since Ghostscript can accept PostScript and PDF as input, there is no value in turning the PDF into PostScript before feeding it to the txtwrite device. All you are doing is making life harder for the device and discarding useful information.

Just use Ghostscript and the txtwrite device, and give it the PDF file as an input.

Naturally OCR works, because it scans the shapes of the text to determine the character, but yes its slow. On the other hand it will work with PDF files which only contain images of text, not actual text, which the txtwrite device won't.

answered Jun 14 '18 at 07:22

KenS

30,202
3
34
51

Unfortunately I need to intercept the printing of old software (No source code). And it sends to print Texts or PDFs – Marc.Adans Jun 19 '18 at 20:21
So you are saying that the software prints to a printer, and you are intercepting that ? If so where does the PDF come in ? It might help to see an example of what you are feeding to Ghostscript. – KenS Jun 19 '18 at 20:33
I'm using RedMon(Redirection Port Monitor), HP Universal Driver PS and GhostScript to intercept document printing. – Marc.Adans Jun 20 '18 at 11:27
1

Try using the generic PostScript printer (if it still exists in your version of Windows) instead of the HP printer. The Microsoft PostScript driver does special extra stuff to get usable text out, and the txtwrite device is capable of using that additional information. Failing that, try any other PostScript printer but **not** an HP one. I'm still unclear on where the PDF file comes into this. – KenS Jun 20 '18 at 16:54
Agree with @kenS - use the Microsoft driver. I often go to the "Generic" drivers section in the Windows driver picker dialog and then pick the MS Publisher ImageSetter driver. That works well and gives good text extraction capabilities once the PS has been converted to PDF (although I don't use Ghostscript txwrite stuff but instead use a PDF library that lets me analyse the blocks of text within a PDF) – Ian Yates Mar 16 '21 at 11:57

Extract text from PDF files(Printed)

1 Answers1