0

I have a very large Matlab table of about 11 million rows, that I would like to rearrange to have an cell array for each instance of one of the variables. The following smaller data set will illustrate my problem:

Data = table(repmat(randperm(50).',200,1),rand(10000,1),'VariableNames',{'ID','VAR'});

I am able to perform the task by the following command, but it is quite slow when applied to my real table:

UniID = unique(Data.ID);

CellData = arrayfun(@(x)Data(Data.ID==x,:),UniID,'UniformOutput',false);

What can I do to optimize the execution time?

Mace
  • 1,259
  • 4
  • 16
  • 35
  • 1
    Maybe get `Data.ID` into a separate variable? – Divakar Feb 17 '15 at 09:57
  • 1
    Also try use a for loop instead of `arrayfun`, there might be less overhead... – Dan Feb 17 '15 at 10:07
  • Getting rid of `table` will probably be the best way to speed this up. – knedlsepp Feb 17 '15 at 10:13
  • Putting `Data.ID` into a separate variable, did a little, but not a lot. I can also try the for loop. I my real table I have a lot of variables and some of them are not numeric, so I not sure how to avoid table. – Mace Feb 17 '15 at 11:53

1 Answers1

3

As you may want to do some processing after this, consider using:

B = varfun(@(x) {x}, Data, 'GroupingVariables', 'ID');

You can either use this to partition the values into groups like presented above, or directly apply some function like mean, if you change @(x) {x} to @mean. This should be the clearest solution, yet it won't give you any speed gains.


You might however get a little speed gain, if you don't use tables, but simply use arrays. There instead of 'GroupingVariables', you would use accumarray.

If your Data.IDs are positive integers already you don't need any preprocessing step (If they aren't use: [~,~,newID] = unique(ID)) and can just use:

accumarray(Data.ID, Data.VAR, [], @(x) {x})

If your table only has two variables, this will be sufficient. If you are dealing with more than one variable, you will have to use something similar:

accumarray(Data.ID, 1:size(Data,1) ,[], @(I) {Data(I,:)})

Both of these will likely shuffle the internal ordering of each cell-entry. If you don't want this, use this stable version of accumarray.

As the table data-structure has some overhead, this will possibly be even faster, if you don't use the Data table to access the values, but the arrays themselves:

VAR1 = rand(100000,1);
VAR2 = rand(100000,1);
ID = repmat(randperm(50).',2000,1);
VARsPartitioned = accumarray(ID, 1:numel(ID) ,[], @(I) {[VAR1(I,:), VAR2(I,:)]});

For a million rows and 5000 different IDs, I get these results:

arrayfun:                ~30 seconds
varfun:                  ~30 seconds
accumarray using table:  ~3 seconds
accumarray using arrays: ~0.3 seconds

PS: You can also use something like @mean or @std directly with accumarray without the need to group the variables in a first step.

Community
  • 1
  • 1
knedlsepp
  • 6,065
  • 3
  • 20
  • 41
  • Thanks! Looks like avoiding `table` and `arrayfun`, and using `accumarray` instead is the way to go... – Mace Feb 17 '15 at 12:06
  • In my real data, the IDs are positive integers, but they don't run from consecutively from 1 to 50, but might start at 7100 and is not consecutive from 7100. I can't see how I adjust your last solution to this? – Mace Feb 17 '15 at 12:26
  • @Mace: If I understand correctly, you don't want there to be any 'holes'; yes you can use `unique` there. If you don't mind the holes, it should work though. – knedlsepp Feb 17 '15 at 12:52