2

Why is indexing into a dataset array so slow? A peak into the dataset.subsref function shows that all the columns of the dataset are stored in a cell array. However, cell indexing is much, much faster than dataset indexing, which is just indexing into a cell array under the hood. My guess is that this has to do with some overhead with MATLAB OOP. Any ideas on how to speed this up?

%% Using R2011a, PCWIN64
feature accel off;  % turn off JIT

dat = (1:1e6)';
dat2 = repmat({'abc'}, 1e6, 1);
celldat = {dat dat2};
ds = dataset(dat, dat2);
N = 1e2;

tic;
for j = 1:N
    tmp = celldat{2};
end
toc;

tic;
for j = 1:N
    tmp2 = ds.dat2; % 2.778sec spent on line 262 of dataset.subsref
end
toc;

feature accel on;  % turn JIT back on
Elapsed time is 0.000165 seconds.
Elapsed time is 2.778995 seconds.

EDIT: I've updated the example to be more like the problem I'm seeing. A huge amount of time is spent on line 262 of dataset.subsref - "b = a.data{varIndex};". It's very strange to me since it is a simple cell dereference. I'm wondering if there is a OOP trick that will allow me to index into "a.data" without the strange overhead.

EDIT2: As per Andrew's suggestion, I've submitted this as a bug to MatWorks. Will update if I hear anything from them.

EDIT3: Matlab responded and said they are aware of the problem now and will fix it in a future release. They noted that the problem is specific to cell arrays, and to try to avoid them if possible.

Andrey Rubshtein
  • 20,795
  • 11
  • 69
  • 104
Rich C
  • 3,164
  • 6
  • 26
  • 37
  • 1
    What does it look like under the MATLAB profiler? – Richie Cotton Jul 14 '11 at 18:54
  • +1 Richie's comment is the best answer to any Matlab performance question. – Andrew Janke Jul 14 '11 at 19:13
  • Over 90% of the time is spent on line 262 of dataset.subsref, which is strange b/c it is a simple cell dereference. Unfortunately, the example I gave is too simple to show this. I will update with a more realistic example. – Rich C Jul 14 '11 at 19:25
  • @Rich C: "b = a.data{varIndex}" is not a simple cell dereference. The "." part is an MCOS object field access, which is then followed by a "{}" cell dereference. I hacked dataset.subsref to break it in to two lines: "tmp = a.data; b = tmp{varIndex};" and re-profiled it. All the time is spent on the "a.data" field access. Again, MCOS overhead. – Andrew Janke Jul 14 '11 at 22:22
  • Wait, only 1e2 passes in your revised code? That is astonishingly slow. That's 20 milliseconds, not microseconds, for that "a.data" field access inside subsref. Something fishy is going on here besides the normal overhead. It looks like that time is data related. E.g. if I change it to "dat = 42; dat2 = {'foo'};", then it gets fast. It's almost like the copy-on-write optimization isn't working and it's doing a big copy of the data. – Andrew Janke Jul 14 '11 at 22:47
  • I played with this some more and made no headway. I'm getting slow timings in R2011a like you are. This sounds worthy of a bug report to MathWorks. – Andrew Janke Jul 19 '11 at 17:10
  • Thanks for trying Andrew. I will submit this as a bug to Matlab. – Rich C Jul 20 '11 at 01:10
  • No problem Rich. I think you should un-accept my answer, though: it only addresses the original small data set example; it doesn't satisfactorily address the super-slow timings you're seeing in the revised question, which are more to the point. Maybe you'll get to post a real answer if and when you hear back from MathWorks. – Andrew Janke Jul 20 '11 at 04:21
  • @Andrew: You offered up 2 very helpful suggestions: submit a bug report and store dataset columns you access often to another variable. It appears the problem has no solution, so yours is the best possible answer. – Rich C Jul 21 '11 at 14:23
  • Hi Rich, did you get a link to a bug tracker item from MathWorks? And do you know if the bug is specific to @dataset or if it affects other classes that contain cell arrays in their fields? – Andrew Janke Aug 02 '11 at 21:39
  • The rep didn't give me a bug tracker. He only said "the developers are aware of the problem and will fix it in a future release". I'm not sure if it applies to other objects. He didn't say if it was a more general problem. – Rich C Aug 03 '11 at 00:44

1 Answers1

3

Yes, you are most likely seeing the overhead of Matlab OOP method calls. They are expensive compared to cell indexing, or method calls in some other languages. Your .513872 seconds / 1e4 ~= 51 microseconds per call, which is the approximate cost of a few MCOS method calls; they're ~5-15 microsececonds each on machines I've seen. So that looks like method overhead of the subsref() call itself and other methods and property accesses it's calling in turn.

For some details and discussion, see: Is MATLAB OOP slow or am I doing something wrong?

I don't know of a way to make this faster, aside from structuring your code to minimize calls to "ds.dat" or other methods. If possible, when working with the data set, call "ds.dat" once, keep it in a local variable and work with it there, and then push it back in to the ds object.

Caveat: I don't know what "feature accel" does or how it could affect these timings.

Edit: I threw it in the profiler like Richie suggested. On my R2009b, looks like about half the time is method call overhead, and the rest in find(), strcmp(), and other operations inside subsref; subsref doesn't call any other methods in turn.

Edit 2: The revised example is showing much higher timings. Method call overhead doesn't account for all that.

Community
  • 1
  • 1
Andrew Janke
  • 23,508
  • 5
  • 56
  • 85
  • Yes, saving ds.dat is a good trick to save time. I always do it if I know I'm going to need the same column in a loop. accel off turns off JIT so it doesn't skew the timings for the cell array loop. – Rich C Jul 14 '11 at 19:41