Images extracted from PDF are horizontally fragmented

Question

I have to extract images from corporate PDF files that contain technical drawings. The PDF files conform to a PDF/A format.

I'm using an approach with Apache's pdfbox, which I learned from this question.

/**
 * 
 * @param filename pdf file
 * @param res folder, where images are extracted
 * @throws IOException
 * @throws DocumentException
 */
public class ExtractImages {

    public static void extractImages(String filename, String res)
            throws IOException, DocumentException {
        int pageNo = 0;

        PDDocument document = null;
        document = PDDocument.load(filename);
        List<PDPage> pages = document.getDocumentCatalog().getAllPages();
        Iterator<PDPage> iter = pages.iterator();

        while (iter.hasNext()) {
            pageNo++;
            PDPage page = iter.next();
            PDResources resources = page.getResources();
            Map<String, PDXObjectImage> pageImages = resources.getImages();
            if (pageImages != null) {
                Iterator<String> imageIter = pageImages.keySet().iterator();
                while (imageIter.hasNext()) {
                    String key = (String) imageIter.next();
                    PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
                    image.write2file(res + "_page_" + pageNo + "_" +     key);  
                }
            }
        }
        if (document != null)
            document.close();
    }
}

My problem now is that for some files the extracted images are horizontally fragmented in up to 3 slices. Since I don't want to splice them together manually, I would be glad, if someone had some advice.

EDIT - APPROACH 1

One solution I thought of was to create folders per image, then put all the fragments in their corresponding folders, iterate over the folders and merge the content. That would require some sorting work on my side, but I think it could work.

String key = (String) imageIter.next();

returns Im<number>, number denotes the order of the images per page. So the fragments in the folders would already be in an order and the merging program could easily figure out which part is on top, etc.

EDIT - APPROACH 2

Another approach I could think of: The fragments have their order in their file names in that pattern pdfname_page_[\d]_Im[\d][\.][tiff|png]. So I could sort the images corresponding to that order and then merge all fragments in a row that have the same width. I checked that fragments and it seems, that nearly all images have different dimensions.

What do you say to these approaches?

EDIT3

Since we ran out of time, my colleague and me had to extract the images by hand. I'm still interested, but I'll have to solve this problem in my free time.

Should I add own solution suggestions as an answer/a comment? Or is it okay, to add them to the question? — mike, Nov 09 '12 at 13:22

score 2 · Answer 1 · answered Nov 08 '12 at 17:43

2

The extracted images are fragmented into 3 slices, because the embedded images are too. This is what the PDF generating software most likely did automatically. (It is very rare that, say, an InDesign document designer was doing this on purpose.)

Hence, there is no reliable method which you could use to automatically stitch together the fragments.

What you can try is this -- but only if you have a version of Adobe Acrobat (Pro?) available:

Use the built-in "PDF Optimizer".
In the "Delete Objects" panel, activate the "Detect image fragments and merge them" option.

(Sorry, above menu and UI entries I translated from memory of a German Acrobat Pro installation, so they for sure aren't precisely matching an English UI.)

However, this method will, in my experience, not work very reliably. In most cases of image fragmentation in PDFs it will not work at all. :-(

answered Nov 08 '12 at 17:43

Kurt Pfeifle

86,724
23
248
345

1

As side note, I have seen this problem with images printed to PDF in Windows. For some unknown reason, the internal printing architecture in Windows splits some images in pieces when they are sent to the printer driver. – yms Nov 08 '12 at 17:57
Does your approach work automatically per pdf document, per page or even per image? – mike Nov 09 '12 at 09:17
@mike: The *'Detect image fragments and merge them'* in Acrobat (Pro?) works per PDF document only (and as I said, doesn't work reliably). – Kurt Pfeifle Nov 09 '12 at 10:22

Images extracted from PDF are horizontally fragmented

1 Answers1