I need to query a spreadsheet file in Java. I'm using Apache MetaModel.
I imported it with maven using
<dependency>
<groupId>org.apache.metamodel</groupId>
<artifactId>MetaModel-excel</artifactId>
<version>4.5.2</version>
</dependency>
Everything works fine but the next()
instruction when it's supposed to return false takes a few seconds, why?
import org.apache.metamodel.DataContext;
import org.apache.metamodel.excel.ExcelDataContext;
import org.apache.metamodel.schema.Schema;
import org.apache.metamodel.schema.Column;
import org.apache.metamodel.schema.Table;
import org.apache.metamodel.query.Query;
import org.apache.metamodel.query.OperatorType;
import org.apache.metamodel.data.DataSet;
import org.apache.metamodel.data.Row;
import org.apache.metamodel.MetaModelException;
public class SpreadsheetReader {
private File spreadsheet;
public SpreadsheetReader(String spreadsheetLocation, String spreadsheetName){
this.spreadsheet = new File( spreadsheetLocation + spreadsheetName );
if( !"OK".equals( checkSpreadSheet() ) ){
throw new IllegalStateException("Error in spreadsheet. Cause: "+spreadsheetStatus);
}
}
/** query the excel spreadsheet for the given ID
*/
public List<String> query( String givenProgId ){
List<String> linksArray = new ArrayList<String>();
int rowCount = 0;
ExcelConfiguration conf = new ExcelConfiguration( 1, true, true ); // columnNameLineNumber, skipEmptyLines, sEColumns
DataContext dataContext = new ExcelDataContext( this.spreadsheet, conf );
System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" START-1" ); // ##
Schema schema = dataContext.getDefaultSchema();
System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" STOP-1" ); // ##
// Takes 2 seconds. Will be moved into constructor.
Table table = schema.getTables()[0];
Column idsColumn = table.getColumnByName("ProgID");
Column titlesColumn = table.getColumnByName("Titles");
Query query = new Query().select(titlesColumn)
.from(table)
.where(idsColumn, OperatorType.EQUALS_TO, givenProgId);
try( DataSet dataSet = dataContext.executeQuery(query) ){ // try-with-resource, no need to close dataset
while (dataSet.next()) {
// the rows are read quite quickly, problem will be when next() is false
++rowCount;
Row currentRow = dataSet.getRow();
String currentTitle = (String)currentRow.getValue(0);
linksArray.add( "my-service/titles/"+currentTitle );
System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" START-2" ); // @@@@@@@
}
System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" STOP-2" ); // @@@@@@@
// TAKES ABOUT 6 SECONDS - (Excel file has just 14.779 rows and 114 columns)
}catch(MetaModelException xx){
//logger
throw xx;
}
return linksArray;
}
}}
UPDATE: Some more profiling with a spreadsheet document with just 3 entries:
Code now is:
try( DataSet dataSet = this.dataContext.executeQuery(query) ){
// FIRST NEXT() with result => quite fast
System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" START a\n" );
System.out.println( "\n 88888 NEXT >>> "+(dataSet.next())+" <<<< \n" );
Row currentRow = dataSet.getRow();
String currentTitle = (String)currentRow.getValue(0);
System.out.println( "\n READ: "+(new org.joda.time.DateTime())+" >>> "+currentTitle+" \n" );
System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" STOP a\n" );
// SECOND AND LAST NEXT() => very SLOW
System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" START b\n" );
System.out.println( "\n 88888 NEXT >>> "+(dataSet.next())+" <<<< \n" );
System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" STOP b\n" );
}
And the spreadsheet is pre-loaded in the class constructor.
Logs (with times) for the last of a series of subsequent identical queries are:
Jun 30, 2016 10:59:38 AM log my-project.logging.ITVLogger
INFO: CODE00012 - Query on spreadsheet started for ID 123456
PROFILING >>> 2016-06-30T10:59:38.651+01:00 START a
10:59:38.652 [main] INFO o.a.m.d.RowPublisherDataSet -
Starting separate thread for publishing action: org.apache.metamodel.excel.XlsxRowPublisherAction@4977e527
88888 NEXT >>> true <<<<
READ: 2016-06-30T10:59:39.756+01:00 >>> A_TITLE
PROFILING >>> 2016-06-30T10:59:39.756+01:00 STOP a
PROFILING >>> 2016-06-30T10:59:39.756+01:00 START b
88888 NEXT >>> false <<<<
PROFILING >>> 2016-06-30T10:59:44.735+01:00 STOP b
So to recap it takes about one second to retrieve the result and 4~6 seconds for the last execution of next()
.