4

Recently, I implemented parallelisation in my MATLAB program, much to the suggestions offered in Slow xlsread in MATLAB. However, implementing the parallelism has cropped up another problem - non-linearly increasing processing time with increasing scale.

The culprit seems to be the java.util.concurrent.LinkedBlockingQueue method as can be seen from the attached images of profiler and the corresponding condensed graphs.

Problem: How do I remove this non-linearity as my work involves processing more than 1000 sheets in single run - which would take an insanely long time?

Note: The parallelised part of the program involves just reading all the .xls files and storing them in matrices, after which I start the remainder of my program. dlmwrite is used towards the end of the program and optimization on its time is not really required, although could also be suggested.


Profiler details for reading a single Excel sheet from a file having n sheets.

Graph for the times in the above tables.


Processing multiple sheets from file containing multiple sheets.

Processing multiple sheets from file containing multiple sheets.


Culprit:

Enter image description here

Code being parallelised:

parfor i = 1:runs
    sin = 'Sheet';
    sno = num2str(i);
    sna = strcat(sin, sno);

    data(i, :, :) = xlsread('Processes.xls', sna, '' , 'basic');
end
Community
  • 1
  • 1
OrangeRind
  • 4,798
  • 13
  • 45
  • 57
  • `LinkedBlockingQueue` is a class, not a method. Is it the `take` method that is using up all the time? How do you create the queue? – Gabe Jun 01 '11 at 05:31
  • I added some more details above. the culprit is the q.poll statement. My gut feeling says that this could be happening due to blockages in accessing the excel file due to concurrency - but dunno, I could be totally off-target here. – OrangeRind Jun 01 '11 at 05:39
  • Yes, it's quite likely that the Excel file reader is itself single-threaded, such that you can only read a file from one thread at a time. However, that doesn't explain why it's the `poll` function that is taking all the time. – Gabe Jun 01 '11 at 05:42
  • However - if the the xlsread were single threaded, then atleast it would show a constant performance in reading a single sheet from multiple sheeted file. Another theory says that the program gets scared when it looks at a file having large number of sheets. :D – OrangeRind Jun 01 '11 at 05:58
  • I looked closer at your tables and it appears as though contention isn't the problem -- your program is spending most of its time waiting for entries to be enqueued, not contending for locks to access the queue. Maybe you have some nonlinear processing algorithm. – Gabe Jun 01 '11 at 06:17
  • The parellel processing is only involved in reading the xls files as in the the code I pasted above - I have removed all non-linear processing out of the parellel loop into a following sequential loop. It gets invoked only after this parellel file reading is done and completed. The java function is invoked only in the parellel part. – OrangeRind Jun 01 '11 at 06:21
  • Try to avoid using of java LinkedBlockingQueue since calling java methods is too enormously slow, see http://stackoverflow.com/questions/1693429/matlab-oop-is-it-slow-or-am-i-doing-something-wrong – Jirka cigler Jun 01 '11 at 06:47
  • You've shown that your problem is that you're sitting around waiting for entries to be enqueued, yet you've posted no code that shows how the queue is actually being used. How does data get in the queue? Where does that data come from? – Gabe Jun 01 '11 at 07:07
  • @Gabe I have no idea about that. The java methods are called by matlab's parellel-for-loop automatically. I do not choose to call those specific functions nor do I deliberately choose Java - all that I do is put that xlsread call inside a 'parfor' loop (as shown in the above code) and it does the rest. So I have no idea of any queues building anywhere or thinking of calling any other function. :) – OrangeRind Jun 01 '11 at 07:10
  • looking up at matlab's parellel computing toolbox, I find that it uses JVM to achieve parallelism. No idea other than that. – OrangeRind Jun 01 '11 at 07:13
  • @Jirka it seems that I have no control over which methods to use as I mentioned in the comment above in reply to Gabe. I just call the xlsread method from within 'parfor' and the parellelizing is taken care by MATLAB and JVM completely. – OrangeRind Jun 01 '11 at 07:15
  • I see, you're saying that Matlab invokes the Java code as part of `xlsread`? In that case ther's probably little you can do about it. If you have Excel on your machine, though, I'm curious what removing the `basic` parameter would do to your timings. – Gabe Jun 01 '11 at 07:16
  • Thats right. Removing 'basic' increasing the time by more than an order of magnitude - you can look at the linked question http://stackoverflow.com/questions/6173198/slow-xlsread-in-matlab Apparently, Excel's ActiveXServer takes a long time when responding to matlab calls - thus I need to use 'basic' mode. I also had to give up xlswrite for dlmwrite in another part of my program to bring another order of magnitude reduction in the times. ah. sigh. :) – OrangeRind Jun 01 '11 at 07:21
  • Stepping back a bit. Any way you can change your source to not spit out xls files? It might help outline the bigger picture, why do you need this format? What kind of processing are you doing? Do you need all the data in all the xls files before starting your computation? – Ashish Uthama Jun 01 '11 at 13:43
  • @ashish the most I can do is output to a delimited text file. But then I would need to output each new run as an append in the text file and then separate them later in the program. In xls, I just switch to a new sheet. The program I use to output the data file is a closed source proprietary software. I do need all the data before, as the speed of data generation is slower than data processing and each simulation run is independent. – OrangeRind Jun 02 '11 at 04:30

1 Answers1

0

Doing parallel IO operation is likely to be a problem (could be slower in fact) unless maybe if you keep everything on an SSD. If you are always reading the same file and it's not enormous, you may want to try reading it prior to your loop and just doing your data manipulation in parallel.

zeFrenchy
  • 6,541
  • 1
  • 27
  • 36