14

I have a PDF file that's an output from an OCR processor, this OCR processor recognizes the image, adds the text to the pdf but at the end places a low quality image instead of the original one (I have no idea why anyone would do that, but they do).

So, I would like to get this PDF, remove the image stream and leave the text alone, so that I could get it and import (using iText page importing feature) to a PDF I'm creating myself with the real image.

And before someone asks, I have already tried to use another tool to extract text coordinates (JPedal) but when I draw the text on my PDF it isn't at the same position as the original one.

I'd rather have this done in Java, but if another tool can do it better, just let me know. And it could be image removal only, I can live with a PDF with the drawings in there.

Maurício Linhares
  • 39,901
  • 14
  • 121
  • 158

2 Answers2

17

I used Apache PDFBox in similar situation.

To be a little bit more specific, try something like that:

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import java.io.IOException;

public class Main {
    public static void main(String[] argv) throws COSVisitorException, InvalidPasswordException, CryptographyException, IOException {
        PDDocument document = PDDocument.load("input.pdf");

        if (document.isEncrypted()) {
            document.decrypt("");
        }

        PDDocumentCatalog catalog = document.getDocumentCatalog();
        for (Object pageObj :  catalog.getAllPages()) {
            PDPage page = (PDPage) pageObj;
            PDResources resources = page.findResources();
            resources.getImages().clear();
        }

        document.save("strippedOfImages.pdf");
    }
}

It's supposed to remove all types of images (png, jpeg, ...). It should work like that:

Sample article .

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
IceGlow
  • 615
  • 1
  • 5
  • 17
  • Hi @IceGlow, as I have explained before, I can extract the text using JPedal, but that's not what I'm looking for, I want to remove the image streams from the PDF document itself. Imagine it's like trying to remove all tags from an HTML document, it's just that this is rather complicated to do with PDF files. But thanks anyway for the answer. – Maurício Linhares Aug 01 '11 at 03:01
  • And it did it! Thank you very much @IceGlow! – Maurício Linhares Aug 03 '11 at 02:04
  • I tried the exact same thing, but when it saves the PDF, all Images are intact. I can see that the resource object does not have any images after clear() operation. Help please? – Pushkar Mar 08 '13 at 16:54
  • @MaurícioLinhares - Any inputs on this? – Pushkar Mar 08 '13 at 17:02
  • 4
    This solution only works for simple PDFs: it removes image xobjects immediately associated with the page but image xobjects can also be associated with referenced form xobjects or with patterns; images may even be inlined. Furthermore, strictly speaking removing image xobject resources while not removing the associated operations in the page content stream makes the file not compliant to the PDF specification . – mkl Mar 01 '14 at 08:55
5

You need to parse the document as follows:

public static void strip(String pdfFile, String pdfFileOut) throws Exception {

    PDDocument doc = PDDocument.load(pdfFile);

    List pages = doc.getDocumentCatalog().getAllPages();
    for( int i=0; i<pages.size(); i++ ) {
        PDPage page = (PDPage)pages.get( i );

        // added
        COSDictionary newDictionary = new COSDictionary(page.getCOSDictionary());

        PDFStreamParser parser = new PDFStreamParser(page.getContents());
        parser.parse();
        List tokens = parser.getTokens();
        List newTokens = new ArrayList();
        for(int j=0; j<tokens.size(); j++) {
            Object token = tokens.get( j );

            if( token instanceof PDFOperator ) {
                PDFOperator op = (PDFOperator)token;
                if( op.getOperation().equals( "Do") ) {
                    //remove the one argument to this operator
                    // added
                    COSName name = (COSName)newTokens.remove( newTokens.size() -1 );
                    // added
                    deleteObject(newDictionary, name);
                    continue;
                }
            }
            newTokens.add( token );
        }
        PDStream newContents = new PDStream( doc );
        ContentStreamWriter writer = new ContentStreamWriter( newContents.createOutputStream() );
        writer.writeTokens( newTokens );
        newContents.addCompression();

        page.setContents( newContents );

        // added
        PDResources newResources = new PDResources(newDictionary);
        page.setResources(newResources);
    }

    doc.save(pdfFileOut);
    doc.close();
}


// added
public static boolean deleteObject(COSDictionary d, COSName name) {
    for(COSName key : d.keySet()) {
        if( name.equals(key) ) {
            d.removeItem(key);
            return true;
        }
        COSBase object = d.getDictionaryObject(key); 
        if(object instanceof COSDictionary) {
            if( deleteObject((COSDictionary)object, name) ) {
                return true;
            }
        }
    }
    return false;
}
bora.oren
  • 3,439
  • 3
  • 33
  • 31
paf.goncalves
  • 477
  • 5
  • 18
  • I tried to use your function after running into the same problem as @Pushkar. I am not familiar with java and mainly interested in just getting rid of the images. Would you mind expanding your script to a usable file (especially containing all necessary imports)? – Tim Jul 12 '13 at 09:23
  • This works! Loading and changing pages is much faster now. The filesize has not decreased though (76MB with images, 78MB without images). Is there a way to get rid of the images themselves, so that the files become smaller again? – Tim Jul 12 '13 at 12:25
  • This is a better solution than the accepted one, but it does also delete XObject forms, which are also invoked by "Do". – Tilman Hausherr Aug 06 '16 at 11:51
  • Where can I find list of all operators with description? – Adesh Atole Aug 06 '18 at 14:22
  • You can check here: http://www.verypdf.com/document/pdf-format-reference/pg_0985.htm (that's just the first page) – paf.goncalves Aug 06 '18 at 14:39