1

I am simply trying to read a big excel file(.xlsx) file, using the POI API and getting this exception.

This is the code I am using:

public static void main(String args[]) throws FileNotFoundException {
        File f = new File("E:\\Downloads\\6038_Open_AR.XLSX");
        if (f.exists()){
            System.out.println("Attempting");
            try {
                Workbook workBoo = WorkbookFactory.create(f);
                System.out.println("done");
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

And this is the exception I get.

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOfRange(Arrays.java:3664)
    at java.lang.String.<init>(String.java:207)
    at org.apache.xerces.xni.XMLString.toString(Unknown Source)
    at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
    at org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:137)
    at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:115)
    at org.openxmlformats.schemas.spreadsheetml.x2006.main.WorksheetDocument$Factory.parse(Unknown Source)
    at org.apache.poi.xssf.usermodel.XSSFSheet.read(XSSFSheet.java:184)
    at org.apache.poi.xssf.usermodel.XSSFSheet.onDocumentRead(XSSFSheet.java:176)
    at org.apache.poi.xssf.usermodel.XSSFWorkbook.parseSheet(XSSFWorkbook.java:428)
    at org.apache.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(XSSFWorkbook.java:393)
    at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
    at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:260)
    at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:263)
    at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:222)
    at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:201)
    at Main.main(Main.java:180)

I know this is a very popular topic, I could not find any direct solution that works.

Some of the things that I tried:

File f = new File("E:\\Downloads\\6038_Open_AR.XLSX");
File fileInputStream = new FileInputStream(f);
Workbook workBoo = new HSSFWorkbook(fileInputStream);

This gives the following exception

org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
DockYard
  • 989
  • 2
  • 12
  • 29
  • here is some info about your problem: https://plumbr.io/outofmemoryerror/gc-overhead-limit-exceeded try increasing your heap size , for eclipse here it is shown how to do so: http://www.planetofbits.com/eclipse/increase-jvm-heap-size-in-eclipse/ – JavaMan Aug 14 '19 at 13:26
  • 1
    There are two completely different issues here, one with the memory and the other with the file format. Should be separated to two different posts. – elirandav Aug 14 '19 at 13:27
  • please [refer](https://stackoverflow.com/questions/31844308/java-poi-the-supplied-data-appears-to-be-in-the-office-2007-xml) – thar45 Aug 14 '19 at 13:28
  • @KernelMode i guess once the one issue get resolves the other scope might get reduced – thar45 Aug 14 '19 at 13:29

1 Answers1

2

Without going into specific proprietary information that we used to solve this at my workplace, We used the Apache POI XSSF library to read in one sheet of the .XLSX file at a time. Our source file was 25MB and had 4 separate worksheets.

Apache POI will essentially unzip the XLSX, which expands its size by about 10 times (up to over 200MB in this case). Once unzipped, you have access to the Styles and SharedStrings. We used the XSSFSheetXMLHandler.SheetContentsHandler to then start processing each sheet individually.

We created an AbstractSheetParserService (which reads in the file via XMLReader), sets a ContentHandler using the XSSFSheetXMLHandler.SheetContentsHandler mentioned above, and finally parses the sheet and its values via the AbstractParserHandlerService.

Kit Wolff
  • 46
  • 4
  • Answer this please if you can https://stackoverflow.com/questions/57512192/unable-to-understand-even-model-of-poi-api-for-reading-excel-files-xlsx – DockYard Aug 15 '19 at 15:43
  • My answer for your question would be the same as this one, @DockYard. Apache has several examples of the XSSFSheetXMLHandler.SheetContentsHandler on the POI website. We read in the Styles and SharedStrings portions of the XLSX and handled the sheets one at a time. – Kit Wolff Sep 03 '19 at 17:16