15

I am using Ghostscript to convert source PDF file into array of PNG images. Before I convert PDF page into PNG image I would need to extract (delete) all text from PDF so that converted page image would contain all other elements, excluding text.

Can I achieve this with Ghostscript or will I need to look into different tools?

I would also be interested in a tool that can read-save my source PDF removing all the text.

Primoz Rome
  • 10,379
  • 17
  • 76
  • 108

3 Answers3

24

Since my previous answer, development has continued, and a new option is available now, which justifies a new answer.

The most recent versions of Ghostscript support 3 new parameters, which allow you to remove either all TEXT, or all IMAGE or all VECTOR elements from a PDF.

To remove all TEXT elements from an input PDF, run

gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERTEXT   input.pdf

To remove all raster IMAGE elements from an input PDF, run

gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERIMAGE  input.pdf

To remove all VECTOR elements from an input PDF, run

gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf

Of course, you can also combine any of above two parameters (combining all three will create empty pages.

Here are screenshots of a PDF page, where the original contained all three elements whereas the resulting pages look different.


Screenshot of original PDF page containing "image", "vector" and "text" elements.
Screenshot of original PDF page containing "image", "vector" and "text" elements.


Running the following 6 commands will create all 6 possible variations of remaining contents:

 gs -o noIMG.pdf   -sDEVICE=pdfwrite -dFILTERIMAGE                input.pdf
 gs -o noTXT.pdf   -sDEVICE=pdfwrite -dFILTERTEXT                 input.pdf
 gs -o noVCT.pdf   -sDEVICE=pdfwrite -dFILTERVECTOR               input.pdf

 gs -o onlyIMG.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT  input.pdf
 gs -o onlyTXT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf
 gs -o onlyVCT.pdf -sDEVICE=pdfwrite -dFILTERIMAGE  -dFILTERTEXT  input.pdf

The following image illustrates the results:


Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.
Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.


Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
10

You can achieve what you want without Ghostscript, simply by using a text editor.

  1. Convert your compressed PDF into one which has (nearly) all PDF objects' contents and streams expanded into a readable form using QPDF:

     qpdf --qdf --object-streams=disable input.pdf editable.pdf
    
  2. Open your new editable.pdf file with a text editor (which also gracefully handles any remaining binary blobs inside the PDF such as font or ICC resources).

  3. Search for all occurences of TJ and Tj strings (PDF operators used to show text) inside PDF object streams and replace them with the JT and jT strings respectively (undefined, nonsense PDF operators). Save the file as edited.pdf.

  4. Now convert your edited.pdf to your PNG images as needed.

Note that edited.pdf will still display in most PDF viewers, but the text will be missing as intended. However, it will be easy to restore the text again, by restoring the original TJ/Tj operators and thus reversing any manual modification.


In the "normalized" form created by the qpdf command given above, objects with streams usually look like this (where NNN is an integer number):

NNN 0 obj
<<
   % Here are the key:value pairs of the object dictionary
   /Key1 somevalue1
   /Key2 somevalue2
   % ... (more key:value pairs)
>>
stream
% Here is the content of the object stream
endstream
endobj

An "image stream" has basically the same structure. But the key:value pairs typically contain the following four entries, in any order (where NNN and MMM are integer values giving width and height of the image in pixels):

/Type /XObject
/Subtype /Image
/Width NNN
/Height MMM

Update/Correction

My bad! My original answer contained a repeated typo. I had used tj at places where Tj should have been used. Sorry for any confusion that may have created.

Semnodime
  • 1,872
  • 1
  • 15
  • 24
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Actually - that only worked for one file, while producing garbled output on others. Changing `TJ`s to `JT`s (or any combination) had the same result on these files as well - the output would just garble at some point. What I ended up doing is finding all occurences of `\nBT\n` and `\nET\n` and removing everything between them. – eithed Dec 18 '15 at 14:20
  • 1
    @eithedog: If I can't look at the file itself, I can't analyze why you encounter the behavior you observe. The only thing which (from the top of my head) could have an influence are the **`'`** and **`"`** operators: they are also used to *"show text"*, similar to **`Tj`** and **`TJ`** (but with some additional twists, like automatically move to the next line, or setting word distances). – Kurt Pfeifle Dec 18 '15 at 15:04
  • I understand that and appreciate the help. Might it be that `tj` can be actually encountered within the image streams and that's why altering them would garble the outputted pdf? As I mentioned - in the end I've just removed everything between `BT` and `ET` and that seemed to do the trick. I assumed that was the decoded text stream with all the transformations - as it contained the `tj`s as well - for example: `Td[(C)7(arr)3(ot C)7(ak)8(e......Ł2)]TJ`, but this as well: `Tm (DRINKS)Tj` – eithed Dec 18 '15 at 16:07
  • 1
    @eithedog: *"Might it be that tj can be actually encountered within the image streams and that's why..."*. Yes. Be careful where you change the **`TJ`** and **`Tj`** strings: only inside *"PDF object streams"* (as I say in my answer), never *globally* across all of the PDF file (where it may match an image stream)... – Kurt Pfeifle Dec 18 '15 at 16:58
1

Obviously this is not a standard requirement, but it was recently discussed on the #Ghostscript forum on IRC. The channel is logged and you can find the discussion here:

http://ghostscript.com/irclogs/2014/05/21.html

We originally suggested changing the initial text rendering mode to 3 in pdf_ops.ps, but that had no effect on the file as it was using a type 3 font. So we suggested instead altering the definitions of TJ and Tj in the same file. Look at around 15:37 in the log.

KenS
  • 30,202
  • 3
  • 34
  • 51
  • In pdf_ops.ps, alter the definitions of /TJ and /Tj in each case replace 'Show' with 'pop'. Depending on your operating system and how Ghostscript has been built you may need to rebuild Ghostscript, or include the directory containing the modified files by putting -I on the command line – KenS Jun 20 '14 at 14:00
  • Can I do this if I already have installed GS on OS X? I can't find `pdf_ops.ps` on the HDD. I have now also downloaded GS source and found this file and /TJ, \Tj definitions. I guess I need to rebuild it when I change these? And what is the command I need to run to remove text from the PDF file after I do these /TJ, /Tj changes? – Primoz Rome Jun 30 '14 at 13:15
  • Ghostscript can be built in many ways... If you build with COMPILE_INITS=1 then the support files are built into the executable. If you build with COMPILE_INITS=0, then they are on disk. In either case you can use the -I switch (include) to tell Ghostscript to look in a directory,or list of directories, for files *first*. So you can put a modified gs/Resource/Init somewhere, change pdf_ops.ps and then tell GS to use that directory. You then use the pdfwrite device to make a *new* PDF file (it leaves the original untouched), because the text operators a no-ops, the new file has no text. – KenS Jun 30 '14 at 13:49
  • Oops, since you are rendering to PNG, just use whatever command line you are already using, Again, because the TJ and Tj operators are no-ops, the text won't be rendered. – KenS Jun 30 '14 at 13:50
  • Ok thanks, hopefully I can make this work! I have never build GS myself, I just used the OS X installer to install it on the system. I will try with the -l switch to point to modified resource files. – Primoz Rome Jun 30 '14 at 13:53