0

I am new to Java programming. My current project requires me to read embedded(ole) files in an excel sheet and get text contents in them. Examples for reading embedded word file worked fine, however I am unable to find help reading an embedded pdf file. Tried few things by looking at similar examples.... which didn't work out.

http://poi.apache.org/spreadsheet/quick-guide.html#Embedded

I have code below, probably with help I can get in right direction. I have used Apache POI to read embedded files in excel and pdfbox to parse pdf data.

public class ReadExcel1 {

public static void main(String[] args) {

    try {

        FileInputStream file = new FileInputStream(new File("C:\\test.xls"));

        POIFSFileSystem fs = new POIFSFileSystem(file);
        HSSFWorkbook workbook = new HSSFWorkbook(fs);

        for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {

            String oleName = obj.getOLE2ClassName();

           if(oleName.equals("Acrobat Document")){
                System.out.println("Acrobat reader document");

                try{
                    DirectoryNode dn = (DirectoryNode) obj.getDirectory();
                    for (Iterator<Entry> entries = dn.getEntries(); entries.hasNext();) {

                        DocumentEntry nativeEntry = (DocumentEntry) dn.getEntry("CONTENTS");
                        byte[] data = new byte[nativeEntry.getSize()];

                        ByteArrayInputStream bao= new ByteArrayInputStream(data);
                        PDFParser pdfparser = new PDFParser(bao);

                        pdfparser.parse();
                        COSDocument cosDoc = pdfparser.getDocument();
                        PDFTextStripper pdfStripper = new PDFTextStripper();
                        PDDocument pdDoc = new PDDocument(cosDoc);
                        pdfStripper.setStartPage(1);
                        pdfStripper.setEndPage(2);
                        System.out.println("Text from the pdf "+pdfStripper.getText(pdDoc));
                    }
                }catch(Exception e){
                    System.out.println("Error reading "+ e.getMessage());
                }finally{
                    System.out.println("Finally ");
                }
            }else{
                System.out.println("nothing ");
            }
        }

        file.close();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

}

Below is the output in eclipse

Acrobat reader document

Error reading Error: End-of-File, expected line Finally nothing

James Shaji
  • 316
  • 5
  • 17
  • The first thing, which looks strange is the `dn.getEntry("CONTENTS")` - the PDF should be in some DirectoryNode called `MBD...` (see [my other answer](http://stackoverflow.com/questions/16910503/embed-files-into-excel-using-apache-poi/17757439#17757439) for more details) ... I guess, you are accessing some empty stream ... can you provide a sample Excel file?! – kiwiwings Aug 26 '13 at 17:34
  • Did you try reading the [Apache POI embedded documents documentation](http://poi.apache.org/poifs/embeded.html)? – Gagravarr Aug 26 '13 at 22:18
  • @kiwiwings I do see "MBD" entries in DirectoryNode which doesn't have any data in it. dn.getEntry("CONTENTS") gives me data with size more than 10000, so assumption was data is available in that particular entry. – James Shaji Aug 27 '13 at 10:37
  • @James Shaji If you would upload a sample file, I can get my hands on. I'll have to try if you get the data without further processing from the HSSFObjectData or if one has to use the POIFS entry to retrieve the data. Furthermore there can be a difference between embedded and (OLE 1.0)-packaged objects, so it's simply easier to find out with a real file (and not just theoretical hinting ...) – kiwiwings Aug 27 '13 at 11:19
  • @kiwiwings I have uploaded the excel sheet to http://jamesshaji.com/sample.xls – James Shaji Aug 27 '13 at 14:23

1 Answers1

1

The PDF weren't OLE 1.0 packaged, but somehow differently embedded - at least the extraction worked for me. This is not a general solution, because it depends on how the embedding application names the entries ... of course for PDFs you could check all DocumentNode-s for the magic number "%PDF" - and in case of OLE 1.0 packaged elements this needs to be done differently ...

I think, the real filename of the pdf is somewhere hidden in the \1Ole or CompObj entries, but for the example and apparently for your use case that's not necessary to determine.

import java.io.*;
import java.net.URL;
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.util.IOUtils;

public class EmbeddedPdfInExcel {
    public static void main(String[] args) throws Exception {
        NPOIFSFileSystem fs = new NPOIFSFileSystem(new URL("http://jamesshaji.com/sample.xls").openStream());
        HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
        for (HSSFObjectData obj : wb.getAllEmbeddedObjects()) {
            String oleName = obj.getOLE2ClassName();
            DirectoryNode dn = (DirectoryNode)obj.getDirectory();
            if(oleName.contains("Acro") && dn.hasEntry("CONTENTS")){
                InputStream is = dn.createDocumentInputStream("CONTENTS");
                FileOutputStream fos = new FileOutputStream(obj.getDirectory().getName()+".pdf");
                IOUtils.copy(is, fos);
                fos.close();
                is.close();
            }
        }
        fs.close();
    }
}
kiwiwings
  • 3,386
  • 1
  • 21
  • 57
  • Thanks kiwiwings!! Where can I find documentation to help me understand file structure? – James Shaji Aug 28 '13 at 05:19
  • Do you really want to read through the MS specs??? The are a two specs to go through: [the OLE structures](http://msdn.microsoft.com/en-us/library/dd942265.aspx), [the binary xls](http://msdn.microsoft.com/en-us/library/cc313154.aspx) and for the other office formats, you'll find the specs close by the 2nd link – kiwiwings Aug 28 '13 at 06:08