2

I'm looking for a solution to remove/delete ALL text from a pdf. I've been using iTextSharp for a while now, and extracting text from a pdf with it is easy (wihouth the use of OCR). However I can't find an option to delete the text.

This solution frankly doesn't work for me.

    page.GetAsArray(PdfName.CONTENTS);

returns null for me, also when using PdfName.Text and some others I've tried.

The library to use doesn't really matter, I just think iTextsharp should be able to do this. However if there is another (free) solution, bring it

EDIT: Just to make clear why I want to remove all text from the pdfs

I want to reduce the size of the pdf's. I do this by reducing the resolution of the images in the pdf. However, in alot of cases the vector images take up most of the space. So I thought of the following: Remove all text, than convert the remaining pdf (with only the images and vectors) to a bitmap (jpeg). After that I paste the text over it again. Another option would be to make the text invisible, but I don't think this is any easier.

Community
  • 1
  • 1
Chumbawamba
  • 67
  • 2
  • 5
  • Just to clarify, you are trying to remove the text from the pdf, but leave the image intact? – Steve Czetty Oct 01 '12 at 14:21
  • 1
    To clarify even more: you want to remove all traces of recognizable text, so in its place is white area? Or you want to convert text consisting of fonts into small raster images so that copy'n'pasting the same text doesn't work any more (but reading it still works)? Or alternatively, convert the complete PDF page into one raster image (instead of a collection of vector objects) so copy'n'paste does no longer work? – Kurt Pfeifle Oct 01 '12 at 14:34
  • 1
    My goal is to completely remove all text (that are not bitmap) from the pdf and leave the rest as it is. – Chumbawamba Oct 01 '12 at 17:03

3 Answers3

2
  1. The /Contents of a page dictionary doesn't always consist of an array. It should be evident that GetAsArray() returns null if the content is stored as a stream.
  2. Suppose you use GetAsStream() and you remove all the text contents from the stream, then you may still have text content in XObjects. That text won't be referenced from a content stream, but iText won't be able to remove the XObjects as 'unused objects' because the objects will still be referenced from the /Resources in the page dictionary.

Please read ISO-32000-1 to find out what you're doing wrong.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • `GetAsStream()` indeed returns something. But if I delete everything form the stream, the whole page is blank, also the images are removed. How can delete ONLY the text from the stream? Thank you – Chumbawamba Oct 02 '12 at 11:31
  • 2
    You need to parse the PDF syntax, keeping all the graphics state operators, and keeping all the graphics state operators. For every 'Do' operator, you need to check if you're dealing with a Form XObject or an Image XObject. You have to keep all the Image XObjects, and examine all the Form XObjects (again throwing away text, and keeping graphics state and images). If you hire somebody to do this, count on paying 2 to 3 days of work. – Bruno Lowagie Oct 02 '12 at 11:47
  • Sorry but I'm not hiring somebody to do this for me.. I updated my question with why I want to remove all text. – Chumbawamba Oct 02 '12 at 12:43
  • 1
    I've read your requirement. You should start by studying ISO-32000-1 (take a couple of weeks). Then you should write a PDF syntax parser that creates 2 different PDFs: one containing the text, one containing the images. Then do whatever magic is needed on the PDF with the images. Finally superimpose the PDF with the text on the PDF with the images. If you don't know anything about iText, you'll need a couple of weeks. The result may not be what you expect. Vector images usually take less space than raster images. – Bruno Lowagie Oct 02 '12 at 12:52
  • 1
    If it's not that special, why don't you just start coding ;-) – Bruno Lowagie Oct 02 '12 at 13:58
2

Now that you've updated your question, and revealed the motivation of the intended measure, let me tell you the truth:

  • These measures will in no way reduce the size of PDFs.

  • Instead they'll lead to a hugely increased file:

    1. First removing text + fonts may lead to a slight shrinking of the size, yes.

    2. Then converting the remains of the page to a bitmap will certainly increase the size hugely (or you agree with very low image quality, maybe?).

    3. At last 'pasting' text over it again will increase the file size again (very likely by the same amount you saved in the first step).

It's not a good plan at all.

If you provide (a link to) one of your typical sample PDF file I can probably come up with a Ghostscript (plus other tools) command line that works out of the box and shrinks the PDF size more efficiently.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • 1
    I'm sorry but I cant share the pdfs, but I can tell you about it. All pdf's are A4 in size. They usually contain alot of vector images that are highly detailed, which take up several MB's, whilst as a bitmap they can be 100kb. I did my research on the filesizes and the differences were significant. – Chumbawamba Oct 03 '12 at 13:02
0

To remove all text in a PDF, the easiest solution is using ghostcript

gs -o output_no_text.pdf -sDEVICE=pdfwrite -dFILTERTEXT  input.pdf
user3492925
  • 161
  • 2
  • 14