Group by one column and calculate the median for the groups in matlab

Question

I read the following table from mysql into a matlab cell array:

Nyse = fetch(conn,'SELECT ticker,date,utcsec,bid,ofr FROM HFE.Quotes where ex="N" order by utcsec,bid;');

The Nyse cell array contain 1000000 rows. And I want to calculate the median bid for each second, where the second is recorded as a string in the utcsec column. I do it in the following way:

utcsec=cell2mat(Nyse(:,3));
bid=cell2mat(Nyse(:,4));
NyseBid=grpstats(bid,utcsec,{'median'});

The problem is that the function grpstats takes around 70 sec to do the task. The question is, how can I optimize the code so it will run faster?

An example string in the UTCSEC column is '09:30:00'.

One million rows is not a trivial amount of data. Do you only need the median or do you need the other information calculated by `grpstats`? — sco1, Nov 18 '15 at 18:35
I'm not sure I quite follow the data. What are the dimensions and data types of `utcsec` and `bid`? Is `utcsec` just a time vector? — sco1, Nov 18 '15 at 18:41
utcsec is a vector containg string with the following format '09:30:00','09:30:00','09:30:01' and so on and so forth — Søren J, Nov 18 '15 at 18:45
What about `bid`? Is `utcsec` [monotonic](https://en.wikipedia.org/wiki/Monotonic_function)? — sco1, Nov 18 '15 at 18:49
bid is a vector of numbers 2.3, 2.4,4.5,4.6 and so on and so forth. utcsec is monotomic i guess. utcsec is incresing — Søren J, Nov 18 '15 at 18:52
If `bid` and `utcsec` are both vectors what exactly do you want to group for each timestamp? There's only one measurement... — sco1, Nov 18 '15 at 18:54
What SQL server are you using? This might possibly done entirely within SQL. — Matthew Gunn, Nov 18 '15 at 23:57
There is no "median" function in MySQL. (But it can be done in some clumsy code.) If the `SELECT` is the slow part, then add `INDEX(ex, utcsec, bid)`. — Rick James, Nov 28 '15 at 03:17

score 0 · Answer 1 · edited May 23 '17 at 10:27

I suggest you checkout this question and these answers as it is a highly related problem.

To apply the results of that thread to this problem, I would use this MEX function I wrote which takes in an array of groupIDs and extracts which rows each group is on. This allows efficient aggregation by group.

As I understand, utcsec is essentially a groupID and bid is the array to aggregate. The code would go:

utcsec = Nyse(:,3);    %utcsec in this should be a cell array o fstrings
[unique_utcsec, map] = mg_getRowsWithKey(utcsec);  %call to my magic function
         %unique_utcsec contains unique strings in utcsec
         %map shows us which rows correspond to each unique second

median_bid = zeros(length(unique_utcsec), size(bid,2));

for i = 1:length(unique_utcsec)  %iterate over each utc second
    median_bid(i,:) = median(bid(map{i},:),1);  %calculate the median for that second
end

In my testing, this code is significantly faster than Matlab's implementation with the grpstats function. There are alternative approaches in that thread as well that do not rely on mex. The mex c++ code should be compiled with:

mex -largeArrayDims mg_getRowsWithKey.cpp

And then the function mg_getRowsWithKey can be called like any Matlab function. mg_getRowsWithKey is written in c++ using STL libraries such as map.

If you use this approach, I'd love to know what the speedup was. — Matthew Gunn, Nov 18 '15 at 23:32

Group by one column and calculate the median for the groups in matlab

1 Answers1