Extraction of images present inside a paragraph

Question

I am building an application where i need to parse a pdf which is generated by a system and with that parsed information i need to populate my applications database columns but unfortunaltely the pdf structure that i am dealing with is having a column called comments which has both text and image. I found the way of reading the images and text separately from the pdf but my ultimate aim was to add a place holder something like {2} in the place of image inside the parsed content and whenever my parser ( the application code ) parse this line the system will render the appropriate image in that area which is also stored in a separate table inside my application. Please help me with resolving this problem.

Thanks in advance.

As you don't show your code, it is difficult to say what you need to change. Essentially use a customized text extraction strategy to insert a "[2]" text chunk at the coordinates of the image. — mkl, Jan 20 '15 at 10:02
@mkl sorry about the code we havent started with the implementation of code yet we are still analyzing if this can be done with itext . As you said i went through text extraction stratergy my need is like this the comments section will be like " the graphical area is covered with 325 kms <> ..... " Where <> will contain a image in the pdf so with this text extraction stratergy will it be possible for me to do like this "the graphical area is covered with 325 kms {2}....." where 2 will point to a unique area where my image will be stored (simply a database or a file system.) — Karthik, Jan 21 '15 at 08:58
It sounds like something that is possible with some extra programming (writing a subclass of the rendering interfaces). — Bruno Lowagie, Jan 22 '15 at 09:12
@BrunoLowagie I think so, too. One only has to take care to properly associate the image with the base line of the surrounding text (if the image is drawn in line). If you have text over image over text, though, it should be really easy. — mkl, Jan 22 '15 at 09:22
Yes, deciding which coordinate to take into account when inserting the (X) will be a design decision. One could use the bottom Y coordinate, the top Y coordinate, something in the middle... That's up to the person implementing the application, based on the nature of the images. — Bruno Lowagie, Jan 22 '15 at 09:30

mkl · Accepted Answer · 2015-01-22T16:25:52.247

As already mentioned in comments, a solution would be to essentially use a customized text extraction strategy to insert a "[ 2]" text chunk at the coordinates of the image.

Code

You can e.g. extend the LocationTextExtractionStrategy like this:

class SimpleMixedExtractionStrategy extends LocationTextExtractionStrategy
{
    SimpleMixedExtractionStrategy(File outputPath, String name)
    {
        this.outputPath = outputPath;
        this.name = name;
    }

    @Override
    public void renderImage(final ImageRenderInfo renderInfo)
    {
        try
        {
            PdfImageObject image = renderInfo.getImage();
            if (image == null) return;
            int number = counter++;
            final String filename = String.format("%s-%s.%s", name, number, image.getFileType());
            Files.write(new File(outputPath, filename).toPath(), image.getImageAsBytes());

            LineSegment segment = UNIT_LINE.transformBy(renderInfo.getImageCTM());
            TextChunk location = new TextChunk("[" + filename + "]", segment.getStartPoint(), segment.getEndPoint(), 0f);

            Field field = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
            field.setAccessible(true);
            List<TextChunk> locationalResult = (List<TextChunk>) field.get(this);
            locationalResult.add(location);
        }
        catch (IOException | NoSuchFieldException | SecurityException | IllegalArgumentException | IllegalAccessException ioe)
        {
            ioe.printStackTrace();
        }
    }

    final File outputPath;
    final String name; 
    int counter = 0;

    final static LineSegment UNIT_LINE = new LineSegment(new Vector(0, 0, 1) , new Vector(1, 0, 1));
}

(Unfortunately for this kind of work, some members of LocationTextExtractionStrategy are private. Thus, I used some Java reflection. Alternatively you can copy the whole class and change your copy accordingly.)

Example

Using that strategy you can extract mixed contents like this:

@Test
public void testSimpleMixedExtraction() throws IOException
{
    InputStream resourceStream = getClass().getResourceAsStream("book-of-vaadin-page14.pdf");
    try
    {
        PdfReader reader = new PdfReader(resourceStream);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        SimpleMixedExtractionStrategy listener = new SimpleMixedExtractionStrategy(OUTPUT_PATH, "book-of-vaadin-page14");
        parser.processContent(1, listener);
        Files.write(new File(OUTPUT_PATH, "book-of-vaadin-page14.txt").toPath(), listener.getResultantText().getBytes());
    }
    finally
    {
        if (resourceStream != null)
            resourceStream.close();
    }
}

E.g. for my test file (which contains page 14 of the Book of Vaadin):

page 14 of the Book of Vaadin

You get this text

Getting Started with Vaadin
• A version of Book of Vaadin that you can browse in the Eclipse Help system.
You can install the plugin as follows:
1. Start Eclipse.
2. Select Help   Software Updates....
3. Select the Available Software tab.
4. Add the Vaadin plugin update site by clicking Add Site....
[book-of-vaadin-page14-0.png]
Enter the URL of the Vaadin Update Site: http://vaadin.com/eclipse and click OK. The
Vaadin site should now appear in the Software Updates window.
5. Select all the Vaadin plugins in the tree.
[book-of-vaadin-page14-1.png]
Finally, click Install.
Detailed and up-to-date installation instructions for the Eclipse plugin can be found at http://vaad-
in.com/eclipse.
Updating the Vaadin Plugin
If you have automatic updates enabled in Eclipse (see Window   Preferences   Install/Update
  Automatic Updates), the Vaadin plugin will be updated automatically along with other plugins.
Otherwise, you can update the Vaadin plugin (there are actually multiple plugins) manually as
follows:
1. Select Help   Software Updates..., the Software Updates and Add-ons window will
open.
2. Select the Installed Software tab.
14 Vaadin Plugin for Eclipse

and two images book-of-vaadin-page14-0.png

and book-of-vaadin-page14-1.png

in OUTPUT_PATH.

Improvements to make

As also already mentioned in comments, this solution is for the easy situation in which the image has text above and/or below but neither left nor right.

If there is text left and/or right, too, there is the problem that the code above calculates LineSegment segment as the bottom line of the image but the text strategy usually works with the base line of text which is above the bottom line.

But in this case one first has to decide at which position on which line one wants the marker in the text to be anyways. Having decided that, one can adapt the source above.

thanks for the answer this is what i want for my application and i tried this code but can you tell me what UNIT_LINE means in the above solution i thought it was a method but i can't find it in itext library? — Karthik, Jan 22 '15 at 13:42
*what UNIT_LINE means* - oops, sorry, forgot to copy it. It's the constant line from (0,0) to (0,1), a constant in `SimpleMixedExtractionStrategy`. I'll edit my answer right away. — mkl, Jan 22 '15 at 16:23

Extraction of images present inside a paragraph

1 Answers1

Code

Example

Improvements to make

Linked