0

Context: I am reading an Excel File in a certain format using Apache POI library. Each file has only one sheet and a certain template. I am able to read the sheet, perform some manipulation on the values, store them as POJO, and then convert them to XML using a JAXB implementation.

Problem: I am reading only a few excel files (say 100) for now, but I wanted to design my application in such a way that it is scalable enough to read around 1000 to 10000 files. Can you suggest a good architecture for the same. Also, should i be using multithreading (say a threadpool of 10 threads) to read 10 sheets at once, or would that be a bad design considering the fact that each sheet has separate data that it not interlinked with any other sheet.

Note: I cannot share the any code snippets since that is proprietary code, although for the sake of assumption, we can assume each sheet to have 50 rows and each row has 6 to 10 columns with plain text data in all the cells. Since the file is small, I am loading the entire file in memory and then processing it. Also, I am using apache poi code to iterate through the rows and columns (sample below)

XSSFWorkbook workbook = new XSSFWorkbook(fileInputStream);
sheet = workbook.getSheetAt(0);
//outer for loop using 'i' to iterate all rows
    row = sheet.getRow(i);
    //inner for loop using 'j' to iterate all columns in a row
        value = row.getCell(j)
        //use 'value' as and when required
    //end inner for loop
//end outer for loop

P.S. This is my first question on SO, so please feel free to suggest any changes/improvements in my question.

Thanks and Regards, Sid

phoenixSid
  • 447
  • 1
  • 8
  • 22
  • 1
    well we would have to know how exactly you read the data. you could change the code to simulate how you read the data (shouldn't be a problem with the proprietary thing) – XtremeBaumer Apr 21 '17 at 07:28
  • Since they are not interlinked it is very well suited for parallel processing. – Klas Lindbäck Apr 21 '17 at 09:01
  • @XtremeBaumer Have added a snippet. Please see if it helps. – phoenixSid Apr 25 '17 at 14:31
  • from looking at the code snippet i would say you made some mistakes there. sadly i don't really understand what you mean with this: `the fact that each sheet has separate data that it not interlinked with any other sheet`. i think the best you can do is to switch to parallel streams to implement this. it could be faster. otherwise try it with multithreading – XtremeBaumer Apr 25 '17 at 14:48
  • @XtremeBaumer Could you please elaborate on what you mean by "switch to parallel streams". – phoenixSid May 31 '17 at 10:42

1 Answers1

0

If you want to process many things in parallel you might want to change to SAX parsing in POI, we increased performance by orders of magnitude doing that (We had quite large files to begin with though).

You say you are loading files into memory, to improve performance you should also consider using SSDs rather than HDDs if it is much I/O. (If you can't fit it all in RAM)

Also the thread-pool size should take this into cosideration: Number of processor core vs the size of a thread pool.

You could consider making it a more pipeline approach (depending on how you code is structured) so you do something like:

excelFiles.parallelStream().map(read).map(mainpulate).map(store).map(convert)

If you squeeze filters in there somewhere or if something returns empty you can reduce the load more easily in the lazy approach.

Just a few ideas, YMMV.

Community
  • 1
  • 1
Viktor Mellgren
  • 4,318
  • 3
  • 42
  • 75
  • Thanks for your ideas! However I am looking for a software architecture as in class diagrams n such which would be good in his scenario. Also, jaxb makes it much easier than manually parsing everything via sax. – phoenixSid May 13 '17 at 05:47