images inverted and split when extracting images from pdf document by using PDFBox or Poppler

Question

want to extract whole images per page in a pdf document by using PDFBox in JAVA. but all extracted images were inverted and split. It should be noted that it's not a bug in PDFBox or poppler but some format reasons of the pdf document itself. so how can i piece together the whole image and get the right direction of every image? could anybody give me some advices? a snippet of JAVA code is preferred. my pdf link: download

Wow, that is weird. The only easy solution coming to my mind would be to simply render the whole page as an image, without the main text if you wish. You can use the `PdfRenderer` class for that, probably customized (see the `CustomPageDrawer` example for inspiration). — mkl, May 25 '23 at 17:22
@mkl,but only the image in per page is what i want, the text resources are useless for me. I can not figure out why images are split and how to piece together every "broken" image. — biotech7, May 25 '23 at 23:32
following is my JAVA code for extracting images based on PDFBox library. download [link] https://banyafx.oss-cn-hangzhou.aliyuncs.com/assets/pdf/ExtractImagesInPdf.java — biotech7, May 25 '23 at 23:47
biotech7 - First of all, do you merely need to export the figures from that one PDF (or at most a very few ones)? Then the manual way @KJ proposes is the best option. Otherwise, are all the PDFs built alike, having the same internal structures? Then I'd have an idea for how to programmatically filter the main text from the PDF and organize the figures on a one-per-page base. Otherwise you could try and employ image analysis mechanisms to the rendering of the full page that try to determine the area of each figure and export those areas. — mkl, May 26 '23 at 08:22
@mkl,I have quite a few PDF documents like this for processing. so prommatically handling these documents is suitable for collections of these structure-contained images. All the pdf have the same internal structures. I visited your github pdfbox project(https://github.com/mkl-public/testarea-pdfbox1) and studied the ImageLocator.java(https://github.com/mkl-public/testarea-pdfbox1/blob/master/src/main/java/mkl/testarea/pdfbox1/content/ImageLocator.java)，but not suitable for my pdf document handling. could you give me snippets of java code to handle my pdf doc? — biotech7, May 27 '23 at 10:54
I'll take a look. I might not find the time until after Whitmonday, though. — mkl, May 28 '23 at 09:59

K J · Answer 1 · 2023-05-26T02:25:09.070

Here are the first 6 Images and we can see they are simply the text on the write whereas the art work is specified as single vector line paths (as shown on the left)

Extracting such thousands or hundreds of images is more work than its worth
Page 1 alone has 115 at unusually high density of 1200 ptpi

C:\Apps\PDF\poppler\poppler-23.05.0\Library\bin>pdfimages -list -f 1 -l 1 my.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 stencil   144   468  -       1   1  ccitt  no       348  0  1200  1200  197B 2.3%
   1     1 stencil    64   456  -       1   1  image  no       349  0  1200  1200  165B 4.5%
   1     2 stencil    64   456  -       1   1  image  no       349  0  1200  1200  165B 4.5%
   1     3 stencil    72   468  -       1   1  ccitt  no       350  0  1200  1200  154B 3.7%
   1     4 stencil   192   468  -       1   1  ccitt  no       351  0  1200  1200  264B 2.4%
   1     5 stencil    96   456  -       1   1  ccitt  no       352  0  1200  1200  142B 2.6%
   1     6 stencil   136   570  -       1   1  ccitt  no       353  0  1200  1200  192B 2.0%
   1     7 stencil   224   582  -       1   1  ccitt  no       419  0  1200  1200  329B 2.0%
   1     8 stencil   104   582  -       1   1  ccitt  no       420  0  1200  1200  194B 2.6%
   1     9 stencil   192   582  -       1   1  ccitt  no       345  0  1200  1200  306B 2.2%

So export each marquee area as an image.

It is possible to define the area as program vectors but as fast as you see them (about 4 xy rect values) you could click to clipboard and automate save as image6.png 7.png 8.png etc.

There are those that attempt to specify how a white space may be defined as a capturable area but it depends if you have the time to write a custom detector, based on search for 6. blah or 7. blah (not 1. - 5.) then vector full width for a height under that. here using Poppler.

pdftoppm -f 1 -l 1 -r 300 -x 360 -W 1750 -y 375 -H 360 -png my.pdf out6

and now we have the measure of it we can apply the Y distance uplift between 6. and 7.

@K J. It's a nice way to calculate each image's coordinator by searching 'blah'. so how to search the string 'blah' by PDFBox or Poppler? Is it a key-value for a map or a tag in a Schema-indexed xml string? — biotech7, May 27 '23 at 10:58

score 0 · Accepted Answer · answered May 30 '23 at 16:04

At first glance it looked like each of the figures in question was drawn in a separate block of content stream instructions enveloped by but not containing text objects. Thus, one approach to isolate them is to export all such blocks of instructions to a separate new page. You then can post-process these new pages, e.g. by rendering them as bitmap images using a PdfRenderer.

I based code doing this on the PdfContentStreamEditor originally from this answer like this:

PDDocument document = PDDocument.load(...);

for (PDPage page : document.getDocumentCatalog().getPages()) {
    PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
        ByteArrayOutputStream commonRaw = null;
        ContentStreamWriter commonWriter = null;
        int depth = 0;

        @Override
        public void processPage(PDPage page) throws IOException {
            commonRaw = new ByteArrayOutputStream();
            try {
                commonWriter = new ContentStreamWriter(commonRaw);
                startFigurePage(page);
                super.processPage(page);
            } finally {
                endFigurePage();
                commonRaw.close();
            }
        }

        @Override
        protected void write(ContentStreamWriter contentStreamWriter, Operator operator,
                List<COSBase> operands) throws IOException {
            String operatorString = operator.getName();
            if (operatorString.equals("BT")) {
                endFigurePage();
            }
            if (operatorString.equals("q")) {
                depth++;
            }
            writeFigure(operator, operands);
            if (operatorString.equals("Q")) {
                depth--;
            }
            if (operatorString.equals("ET")) {
                startFigurePage(getCurrentPage());
            }

            super.write(contentStreamWriter, operator, operands);
        }

        OutputStream figureRaw = null;
        ContentStreamWriter figureWriter = null;
        PDPage figurePage = null;
        int xobjectsDrawn = 0;
        int pathsPainted = 0;

        void startFigurePage(PDPage currentPage) throws IOException {
            figurePage = new PDPage(currentPage.getMediaBox());
            figurePage.setResources(currentPage.getResources());
            PDStream stream = new PDStream(document);
            figurePage.setContents(stream);
            figureWriter = new ContentStreamWriter(figureRaw = stream.createOutputStream(COSName.FLATE_DECODE));
            figureRaw.write(commonRaw.toByteArray());
            xobjectsDrawn = 0;
            pathsPainted = 0;
        }

        void endFigurePage() throws IOException {
            if (figureWriter != null) {
                figureWriter = null;
                figureRaw.close();
                figureRaw = null;
                if (xobjectsDrawn > 0 || pathsPainted > 3)
                    document.addPage(figurePage);
                figurePage = null;
            }
        }

        final List<String> PATH_PAINTING_OPERATORS = Arrays.asList("S", "s", "F", "f", "f*",
                "B", "B*", "b", "b*");

        void writeFigure(Operator operator, List<COSBase> operands) throws IOException {
            if (figureWriter != null) {
                String operatorString = operator.getName();
                boolean isXObjectDo = operatorString.equals("Do");
                boolean isPathPainting = PATH_PAINTING_OPERATORS.contains(operatorString);
                if (isXObjectDo)
                    xobjectsDrawn++;
                if (isPathPainting)
                    pathsPainted++;
                figureWriter.writeTokens(operands);
                figureWriter.writeToken(operator);
                if (depth == 0) {
                    if (!isXObjectDo) {
                        if (isPathPainting)
                            operator = Operator.getOperator("n");
                        commonWriter.writeTokens(operands);
                        commonWriter.writeToken(operator);
                    }
                }
            }
        }
    };
    editor.processPage(page);
}

document.save(new File(RESULT_FOLDER, "my-isolatedFigures.pdf"));

(IsolateFigures test testIsolateInMy)

The first figures are extracted quite fine:

S30 a	S30 b	S31 a	S31 b

Certain figures, though, turn out to contain text objects and, therefore, are separated in partial images and lose their text content:

S32 b 1	S32 b 2	S32 b 3	S32 b 4

Well, of course you can do text extraction with coordinates and then filter everything with y coordinates outside that range. — mkl, May 30 '23 at 17:13
@mkl, nice work! I'll implement these codes in my project for extracting pdf docs and might make trivial amendments for aggregation of partial separated images. Thanks again for your helpful strategy and elegant codes. — biotech7, May 31 '23 at 12:53

images inverted and split when extracting images from pdf document by using PDFBox or Poppler

2 Answers2