Efficient Algorithm for reading data from Excel Spreadsheet in Java

Question

I'm working on a little project that reads party (aka client) data from a spreadsheet into two hashmaps. One keeps track of each party with the value being the object Party, the other is embedded in the Party Object that keeps track of each party's data. Thing is, the way I'm doing it is with two for-loops, which as we all know is a O(N^2) algorithm. The way it is now is about 500 rows (or 500 parties) with about 65 columns (or 65 labels/values) so at that number of elements it's not really a big deal. However, I've been told it might have to handle over 25 million rows, in which case O(N^2) is a problem (not technically O(N^2) with the columns I guess, but the number of columns could be expanded it's not necessarily set at 65).

Long story short, I need tips on how to reduce the run-time but I can't really think of any other way to access every cell in the sheet.

Here's the relevant code:

package storage;

import java.io.File;
import java.util.HashMap;


import jxl.Sheet;
import jxl.Workbook;

import pojo.Party;

public class PartyStructure {

    private static HashMap<String, Party> map;
    private static PartyStructure partyStructure;
    private String inputFile = "C:/Users/joayers/Documents/API Project Information/Sample Data.xls";
    File excelData = new File(inputFile);

    private PartyStructure() throws Exception
    {
        map = new HashMap<String, Party>();
        readData();
    }

    public static HashMap<String,Party> getPartyCollection() throws Exception
    {
        if(partyStructure==null)
        {
            partyStructure = new PartyStructure();
        }
        return map;
    }
    private void readData() throws Exception 
    {
        Workbook w=Workbook.getWorkbook(excelData);
        Sheet sheet = w.getSheet(0);
        String party_name;
        String labelName;
        String dataField;

        for(int i=1;i<sheet.getRows();i++)
        {
            party_name = sheet.getCell(2, i).getContents().toString();
            //map is a Hashmap<String, Party> 
            map.put(party_name, new Party());

                for(int j=0;j<sheet.getColumns();j++)
                {
                    labelName = sheet.getCell(j, 0).getContents().toString();
                    dataField = sheet.getCell(j, i).getContents().toString();
                    Party party = map.get(party_name);
                    //getPartyInfo is a getter for a HashMap<String, String> that holds values associated with the keys (the labels in excel)
                    party.getPartyInfo().put(labelName, dataField);
                }
        }   
    }

}

Also, is there any difference between a hashmap and a hashtable? They seem like the same thing

I'm not sure it is correct to say the algorithm is O(N^2). Here N is the number of cells, so this is O(N). — Raedwald, Jul 05 '13 at 22:24
Row limit in Excel is in the million range (http://answers.microsoft.com/en-us/office/forum/office_2010-excel/how-many-rows-in-excel-2010-64bit-version-runnin/25076632-ba6f-4454-a386-fc3c92d71ee6) plus I think Raedwald is right about this one so there's no problem here. — Scott Shipp, Jul 05 '13 at 23:16
Embarrassing now that I think about it, you're totally right — , Jul 06 '13 at 04:28

score 0 · Answer 1 · answered Jul 05 '13 at 20:43

0

The first thing I would suggest is to take the declarations outside of (before) the loops:

String party_name = sheet.getCell(2, i).getContents().toString();    // etc.

and labelName, dataField, party. Declare before the loops:

String party_name = "";    // etc.

You haven't said what library you are using for Excel. Some libraries have, for example getUsedRange to narrow down the cells you are searching, and (perhaps) methods to fill an array from a Range.

answered Jul 05 '13 at 20:43

Andy G

19,232
5
47
69

What is the benefit of declaring fields before loops? Does it conserve memory? Also, updated to include entire class, I'm using the jxl library. – Jul 05 '13 at 20:48
1

I do not know whether the compiler will optimize your current code, or what performance improvement it may yield. But, simply, there is no need to keep re-declaring the variables within the loops. – Andy G Jul 05 '13 at 20:52
Okay thanks I like to maintain good style when I can, there is no specified range other than everything in the table. Is there anyway to reduce the runtime by a factor of N or maybe a large constant? – Jul 05 '13 at 20:56
I've only had a quick look but that jxl library looks a little old and , perhaps, limited. Someone else may recommend a better library. If you can grab the used-range and maybe put it into an array then I would expect performance to improve. – Andy G Jul 05 '13 at 20:58
If the Excel file contains just a table then it should be possible to treat it as a database and use a database connection to read the data into a **RecordSet**. I haven't done this with Java though. – Andy G Jul 05 '13 at 21:02

score 0 · Answer 2 · edited May 23 '17 at 12:05

0

If you have to read all cells, and gather the contents in a hash map, you could consider parallelizing this task: You could partition your task by rows: Have some threads work on different regions in your sheet.

In your case you could possibly collect the thread results in separate maps, and in the end you could put this together (so there would be no need to synchronize on the hash map).

A HashMap is not synchronized, a Hashtable is synchronized (details here).

edited May 23 '17 at 12:05

Community

1
1

answered Jul 05 '13 at 20:55

Beryllium

12,808
10
56
86

Oof, I think I had one assignment in school so far where I had to create multiple threads and it was more of a, "Just insert this code don't worry about what it does" than an assignment on how to create multi-threaded programs. Definitely something I need to look into. – Jul 05 '13 at 21:09

Efficient Algorithm for reading data from Excel Spreadsheet in Java

2 Answers2