Apache MetaModel - bad performance querying spreadsheet

Question

I need to query a spreadsheet file in Java. I'm using Apache MetaModel.

I imported it with maven using

<dependency>
    <groupId>org.apache.metamodel</groupId>
    <artifactId>MetaModel-excel</artifactId>
    <version>4.5.2</version>
</dependency>

Everything works fine but the next() instruction when it's supposed to return false takes a few seconds, why?

import org.apache.metamodel.DataContext;
import org.apache.metamodel.excel.ExcelDataContext;
import org.apache.metamodel.schema.Schema;
import org.apache.metamodel.schema.Column; 
import org.apache.metamodel.schema.Table; 
import org.apache.metamodel.query.Query;
import org.apache.metamodel.query.OperatorType; 
import org.apache.metamodel.data.DataSet; 
import org.apache.metamodel.data.Row; 
import org.apache.metamodel.MetaModelException;


public class SpreadsheetReader {

    private File spreadsheet;


    public SpreadsheetReader(String spreadsheetLocation, String spreadsheetName){

        this.spreadsheet = new File( spreadsheetLocation + spreadsheetName );

        if( !"OK".equals( checkSpreadSheet() ) ){
            throw new IllegalStateException("Error in spreadsheet. Cause: "+spreadsheetStatus);
        }

    }


    /** query the excel spreadsheet for the given  ID
    */
    public List<String> query( String givenProgId ){

        List<String> linksArray = new ArrayList<String>();
        int rowCount = 0;

        ExcelConfiguration conf = new ExcelConfiguration( 1, true, true ); // columnNameLineNumber, skipEmptyLines, sEColumns
        DataContext dataContext = new ExcelDataContext( this.spreadsheet, conf );

        System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" START-1" ); // ## 

        Schema schema = dataContext.getDefaultSchema();

        System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" STOP-1" ); // ## 
       // Takes 2 seconds. Will be moved into constructor.

        Table table = schema.getTables()[0];

        Column idsColumn = table.getColumnByName("ProgID");
        Column titlesColumn = table.getColumnByName("Titles");

        Query query = new Query().select(titlesColumn)
                                 .from(table)
                                 .where(idsColumn, OperatorType.EQUALS_TO, givenProgId);

        try( DataSet dataSet = dataContext.executeQuery(query) ){ // try-with-resource, no need to close dataset

            while (dataSet.next()) {

                // the rows are read quite quickly, problem will be when next() is false

                ++rowCount;

                Row currentRow = dataSet.getRow();
                String currentTitle = (String)currentRow.getValue(0);

                linksArray.add( "my-service/titles/"+currentTitle );

                System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" START-2" ); // @@@@@@@
            }
            System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" STOP-2" ); // @@@@@@@ 
            // TAKES ABOUT 6 SECONDS - (Excel file has just 14.779 rows and 114 columns)

        }catch(MetaModelException xx){
            //logger
            throw xx;
        }

        return linksArray;
    }
}}

UPDATE: Some more profiling with a spreadsheet document with just 3 entries:

Code now is:

try( DataSet dataSet = this.dataContext.executeQuery(query) ){


    // FIRST NEXT() with result => quite fast

    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" START a\n" );
    System.out.println( "\n 88888 NEXT >>> "+(dataSet.next())+" <<<< \n" ); 
    Row currentRow = dataSet.getRow();
    String currentTitle = (String)currentRow.getValue(0);
    System.out.println( "\n READ: "+(new org.joda.time.DateTime())+" >>> "+currentTitle+" \n" );

    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" STOP a\n" );


    // SECOND AND LAST NEXT() => very SLOW

    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" START b\n" );
    System.out.println( "\n 88888 NEXT >>> "+(dataSet.next())+" <<<< \n" );
    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" STOP b\n" );

}

And the spreadsheet is pre-loaded in the class constructor.

Logs (with times) for the last of a series of subsequent identical queries are:

Jun 30, 2016 10:59:38 AM log my-project.logging.ITVLogger
INFO: CODE00012 - Query on spreadsheet started for ID 123456



PROFILING >>> 2016-06-30T10:59:38.651+01:00 START a

10:59:38.652 [main] INFO  o.a.m.d.RowPublisherDataSet - 
Starting separate thread for publishing action: org.apache.metamodel.excel.XlsxRowPublisherAction@4977e527

88888 NEXT >>> true <<<< 

READ: 2016-06-30T10:59:39.756+01:00 >>> A_TITLE

PROFILING >>> 2016-06-30T10:59:39.756+01:00 STOP a



PROFILING >>> 2016-06-30T10:59:39.756+01:00 START b

88888 NEXT >>> false <<<< 

PROFILING >>> 2016-06-30T10:59:44.735+01:00 STOP b

So to recap it takes about one second to retrieve the result and 4~6 seconds for the last execution of next().

After the first ```next()``` the app logs ```o.a.m.d.RowPublisherDataSet - Starting separate thread for publishing action: org.apache.metamodel.excel.XlsxRowPublisherAction@3d6e18b4```. Could it be waiting for its termination? — Gabe, Jun 29 '16 at 18:34
Switching to github.com/monitorjbl/excel-streaming-reader, reading the whole document on start up and keeping the columns I need in memory in a Map so subsequent queries are faster — Gabe, Jun 30 '16 at 13:09

score 1 · Answer 1 · answered Jun 29 '16 at 19:35

1

The excel DataContext implementation does it's unwrapping and parsing of the zipped .xlsx file in the background. This means that the DataContext.executeQuery(...) method returns quickly but the DataSet.next() call that comes right after it has to wait for the data to be available in memory. I don't see any way that you can avoid that, it's just a consequence of Excel files being rather complex things to deal with.

answered Jun 29 '16 at 19:35

Kasper Sørensen

321
2
6

Yes but the point is that let's say we just have one row as result, the first call to ```dataset.next()``` and the following ```dataset.getRow().getValue(0)``` print out the value got from the spreadsheet quite quickly (in less than one second) but when it calls ```dataset.next()``` the second time (this will returns false) that call takes 4~8 seconds to complete. That's the bottleneck and I'm wondering why it happens. I've also tried to initialize the DataSet in the constructor but any subsequent query has quite the same delay. Thanks - Nice library by the way, it's very readable :) – Gabe Jun 30 '16 at 00:34
```DataContext``` now initialized in the constructor, spreadsheet now has just 3 entries => the last next() executed still takes 6 seconds **for each query** – Gabe Jun 30 '16 at 11:08
1

That's a very interesting observation. I would suggest raising it as a potential bug in the MetaModel developer mailing list. Would need to dive deeper into it to find out if it is in fact a bug or if it is something else. – Kasper Sørensen Jul 01 '16 at 04:30

Apache MetaModel - bad performance querying spreadsheet

1 Answers1

Linked