1

I am using Apache POI to read/write to an excel file for my company as an intern here. My program goes through the excel file which is a big square with top rows computer names and left column user names. 240 computers and 342 users. the sheet[computer][user] is 0 in all spaces and the program calls PSLoggedon for each computer and takes the username(s) currently logged on and increments their 0 so after running it after a month, it shows who is logged in the most to each computer. So far it runs in about 25 minutes since I used a socket to check socket.connect before actually calling PSLoggedon.

Without reading or writing at all to the excel file, just calling all the PSLoggedon calls to each computer, takes about 9 minutes. So, the reading and writing apparently takes 10-15 minutes. The thing is, I am calling PSLoggedon on the computer, then opening the excel to find the [x][y] spot of the [computer][user] and then writing to it a +=1 then closing it. So the reason it is taking this long I suppose is because it opens and closes the file so much? I could be completely wrong. But I can't think of a way to make this faster by opening and reading/writing all at once and only opening and closing the file once. Any ideas?

EthanSchatz
  • 25
  • 1
  • 8

3 Answers3

3

Normally Apache-POI is very fast, if you are running into some issue then you might need to check below points:

  1. POI's logging might be on, you need to turn them off: You can add one of these –D to your JVM settings to do this:

    -Dorg.apache.poi.util.POILogger=org.apache.poi.util.NullLogger

  2. You may be setting your VM heap to low value, try to increase.

  3. Prefer XLS over XLSX.

Garry
  • 4,493
  • 3
  • 28
  • 48
  • 1
    Splitting the application in two parts where one part (perhaps even concurrently) calls PSLoggedon, and the other part updates the Excel-File seems a better solution. Also this looks like a good example where an in-process database like H2 could speed up things, if you use it as an intermediate store and write out the Excel files sequentially once all reading has been completed. From my experience, jumping around in a file to write has worse performance characteristics than sequential writing. – llogiq Jul 07 '15 at 06:44
  • @Garry Thanks for the feedback, about XLS/XLSX, I was using XLS at first and switched to XLSX because XLS does not let me have enough columns for the number of computers we have. – EthanSchatz Jul 07 '15 at 14:55
  • @llogiq That would be faster, that is what I am trying to figure out how to do. I can't sequentially write because I wouldn't know what to write to each cell. I get the logged in users for a computer and write to their cells. Unless what you mean is get logged in users for a computer and loop through each cell in that computer's column. – EthanSchatz Jul 07 '15 at 14:58
  • And @Garry I want to try to see if the logger is enabled and I found the page you you that info from, but how do I add it to my JVM settings? I've never done something like that. – EthanSchatz Jul 07 '15 at 16:15
  • Telling someone to use XLS over XLSX is the worst advice you can give. – Jonathan Drapeau Jul 09 '15 at 13:51
  • If you don't have anything specific to xlsx and facing performance issue its worth switching and proper advice. – Garry Jul 09 '15 at 14:59
  • @EthanSchatz... I hope you are able to increase the JVM heap size. If not please see this ...http://stackoverflow.com/a/6452812/1129313 – Garry Jul 10 '15 at 11:28
1
  1. Get HSQLDB (or another in-process database, but this is what I've used in the past). Add it to your build.
  2. You can now create either a file-based or in-memory database (I would use file-based, as it lets you persist state between runs) simply by using JDBC. Create a table with the columns User, Computer, Count
  3. In your reading thread(s), INSERT or UPDATE your table whenever you find a user with PSLoggedon
  4. Once your data collection is complete, you can SELECT Computer, User, Count from Data ORDER BY Computer, User (or switch the order depending on your excel file layout), loop through the ResultSet and write the results directly.
llogiq
  • 13,815
  • 8
  • 40
  • 72
  • Thanks for the feedback, this sounds like a good idea to try. It will be my first time ever trying something like this so I am going to read up on how to use HSQLDB and see if I can figure this out. :) I made the file-based JDBC with `Connection c = DriverManager.getConnection("jdbc:hsqldb:file:testdb","SA","");` – EthanSchatz Jul 08 '15 at 15:23
1

This is an old question, but from what I see:

  1. Since you are sampling and using Excel, is it safe to assume that consistency and atomicity isn't critical? You're just estimating fractional usage and don't care if a user logged in and logged out between observations.
  2. Is the Excel file stored over a slow network link? Opening and closing a file 240 times could bring significant overhead. How about the following:
  3. You need to open the Excel file once to get the list of computers. At that time, just snapshot the entire contents of the matrix into a Map<ComputerName, Map<UserName, Count>>. Also get a List<ComputerName> and List<UserName> to remember the row/column headings. The entire spreadsheet has less than 90,000 integers --- no need to bring in heavy database machinery.
  4. 9 minutes for 240 computers, single-threaded, is roughly 2.25 seconds per computer. Is that the expected throughput of PSLoggedOn? Can you create a thread pool and query all 240 computers at once or in a small number of rounds?
  5. Then, parse the results, increment your map and dump it back to the Excel file. Is there a possibility that you might see new users that were not previously in the Excel? Those will need to be added to the Map and List<UserName>.