1

Is it possible to create a tiff file from a postscript-file (created from a pdf-document with readable text and images) into a tiff file without the images and only the text?

Like add a maxbuffer so images will be removed and only text remaining?

And if boxes and lines around text could be removed as well that would be awesome.

Best regards!

Joe
  • 2,675
  • 3
  • 21
  • 26

2 Answers2

3

You can redefine the various 'image' operators so that they don't do anything:

/image {
 type /dicttype eq not { % uses up argument, only one if dict form
   pop pop pop pop   % remove the arguments for the non-dictionary form.
 } ifelse
} bind def

/imagemask {
 type /dicttype eq not { % uses up argument, only one if dict form
   pop pop pop pop   % remove the arguments for the non-dictionary form.
 } ifelse
} bind def

/colorimage {
  type /integertype eq {
    pop                  % multi
    0 1 3 -1 roll {pop} for % one for each colour component
  } {
    pop pop pop
  } ifelse
} bind def

Save that as a file, and add the file to your GS invocation.

You can remove linework similarly by redefining the stroke operator:

/stroke {
  newpath
} bind def

rectstroke is harder, I suggest you read the PLRM if you need that one.

Possibly also the fill operator:

/fill {
  newpath
} bind def

/eofill {
  newpath
} bind def

Beware! Some text is not drawn using the text 'show' operators, but is constructed from linework, or drawn as images. These techniques will be defeated if you redefine the operators as shown above.

Note that the PDF interpreter often doesn't allow re-definition of operators, so you may first have to convert your PDF file to PostScript, using the ps2write device, then run the resulting file through GS to get a TIFF file.

atamanroman
  • 11,607
  • 7
  • 57
  • 81
KenS
  • 30,202
  • 3
  • 34
  • 51
  • Hi thanks for the answer, but how will I invoke the file? Could you give me a command line example for using your file together with the postscript file to produce a tiff file? – Joe Jun 23 '11 at 07:29
  • 1
    If you use: "gswin32 -sDEVICE=tiff32nc -sOutputFile=out.tif input.ps" Then create a file called, for example, noimage.ps with the content above and invoke it like this: "gswin32 =sDEVICE=tiff32nc -sOutputFile=out.tif noimage.ps input.ps – KenS Jun 23 '11 at 10:59
  • I've tried this but it doesn't seem to work actually for some reason. I've created a file called noimage.ps with the code you provided, are there anything else I need to fill in? like stdout or stuff like that? – Joe Jun 27 '11 at 14:49
  • Could be my fault, I forgot to deal with 'inline' image data. DO you get an error message ? See also : http://stackoverflow.com/questions/6466990/remove-images-from-pdf – KenS Jun 28 '11 at 07:49
  • I've thought of a wild wacky way to eat the inline image data. Wrap it in a type3 font to divert rendering into the font cache. ... Wait, nevermind, that's stupid. Just draw it off the page. – luser droog Mar 31 '13 at 21:34
1
gs -sDEVICE=bitrgbtags -o out.tags <myfile>

will create a ppm file with tags - tags label each pixel as text, vector, image etc.

Then you can use the C programs in ghostpdl/tools/GOT to process the image. It sounds like you want to write a new C program to to set each non text pixel to the background color or maybe just white, this is fairly straightforward with the example C programs in the GOT subdirectory as a guide (if you are a programmer). Then you would convert the ppm to tiff. Ken provided a different way of doing this that doesn't require pixel processing.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
Henry
  • 262
  • 2
  • 3
  • I tried your bitrgbtags but it didnt work together with gswin32c. Do you have an example command line for me to use? – Joe Jun 23 '11 at 07:38
  • I'm afraid that I don't think the device is built into the Windows release, at least not as standard. If you can build GS from source you can add it yourself. – KenS Jun 23 '11 at 11:05