0

I am new to java programming.... I need extract each and every tables and images as per source, i try to extract text using by pdfbox but am get text only and text properties. How to identify tables, images, list, etc.. using java program.

Is it possible to identify in pdf files...?

I using module is PDFbox, if any idea further process...,

AlexPandiyan
  • 4,467
  • 2
  • 14
  • 11
  • 2
    What we perceive as tables in PDFs, is generally merely a collection of text pieces drawn at some special positions on the page, not some table object we can query for rows an columns. Generally, therefore, the best one can do is search for lines or four bars without content, either one probably dividing columns or rows. Such a search is not implemented in PDFBox. It does contain the basic methods required to implement that oneself, though. – mkl Sep 29 '14 at 05:38

1 Answers1

0

Below code can be used to extract images:

List pages = document.getDocumentCatalog().getAllPages();
                Iterator iter = pages.iterator();
                while( iter.hasNext() )
                {
                    PDPage page = (PDPage)iter.next();
                    PDResources resources = page.getResources();
                    Map images = resources.getImages();
                    if( images != null )
                    {
                        Iterator imageIter = images.keySet().iterator();
                        while( imageIter.hasNext() )
                        {
                            String key = (String)imageIter.next();
                            PDXObjectImage image = (PDXObjectImage)images.get( key );
                            String name = getUniqueFileName( key, image.getSuffix() );
                            System.out.println( "Writing image:" + name );
                            image.write2file( name );
                        }
                    }
                }

You can refer here for similar issue.

Community
  • 1
  • 1
Imran
  • 429
  • 9
  • 23
  • *similarly you can try for other elements like tables, lists* - **this is ridiculous**. Unless the tables or lists actually are images, extracting them is completely unlike extracting images. Furthermore your code only extracts the image resources of a page. I.e. you do not check whether these images actually are used on the page, and you also ignore inline images. – mkl Sep 29 '14 at 07:39