I have to extract images from corporate PDF files that contain technical drawings. The PDF files conform to a PDF/A format.
I'm using an approach with Apache's pdfbox, which I learned from this question.
/**
*
* @param filename pdf file
* @param res folder, where images are extracted
* @throws IOException
* @throws DocumentException
*/
public class ExtractImages {
public static void extractImages(String filename, String res)
throws IOException, DocumentException {
int pageNo = 0;
PDDocument document = null;
document = PDDocument.load(filename);
List<PDPage> pages = document.getDocumentCatalog().getAllPages();
Iterator<PDPage> iter = pages.iterator();
while (iter.hasNext()) {
pageNo++;
PDPage page = iter.next();
PDResources resources = page.getResources();
Map<String, PDXObjectImage> pageImages = resources.getImages();
if (pageImages != null) {
Iterator<String> imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
image.write2file(res + "_page_" + pageNo + "_" + key);
}
}
}
if (document != null)
document.close();
}
}
My problem now is that for some files the extracted images are horizontally fragmented in up to 3 slices. Since I don't want to splice them together manually, I would be glad, if someone had some advice.
EDIT - APPROACH 1
One solution I thought of was to create folders per image, then put all the fragments in their corresponding folders, iterate over the folders and merge the content. That would require some sorting work on my side, but I think it could work.
String key = (String) imageIter.next();
returns Im<number>, number denotes the order of the images per page. So the fragments in the folders would already be in an order and the merging program could easily figure out which part is on top, etc.
EDIT - APPROACH 2
Another approach I could think of: The fragments have their order in their file names in that pattern pdfname_page_[\d]_Im[\d][\.][tiff|png]
. So I could sort the images corresponding to that order and then merge all fragments in a row that have the same width.
I checked that fragments and it seems, that nearly all images have different dimensions.
What do you say to these approaches?
EDIT3
Since we ran out of time, my colleague and me had to extract the images by hand. I'm still interested, but I'll have to solve this problem in my free time.