1

As a newbie in MATLAB, I find myself always going for the C++-like looping instead of exploiting the matrix-based operations. I encounter another problem today.

Suppose I have two two-column tables (cell arrays), first column being user ID and second column being the corresponding values. However, they are of different sizes in the row. i.e.

Table 1

1 2.56
2 7.4
3 7.7
...
100 83.4

Table 2

1 7.1
3 1.4
4 4.4
...
76 7.2

Despite the differen row sizes, the two tables actually share some common ID. Now I wish to form a new cell array of size Nx3, where N is the number of the common IDs and 1st and 2nd columns are the values from table 1 and table 2, respectively. i.e.

New Table

1 2.56 7.1
3 7.7 1.4
...

Again, I can use a loop to do it, but I really wish to learn a MATLAB-ish way of doing this.

Sibbs Gambling
  • 19,274
  • 42
  • 103
  • 174
  • Were you happy with any of the answers. Personally I believe `ismember` is the most "MATLAB-ish" solution. ;) And I showed it to be faster with sorted data. – chappjc Mar 06 '14 at 19:00
  • @chappjc Yes, I agree. That is awesome! Thanks a lot for the detailed analysis! :D – Sibbs Gambling Mar 10 '14 at 21:17

4 Answers4

2

I personally think using loop is fine. How about using intersect?

A = [1 2.56;2 7.4;3 7.7];
B = [1 7.1; 3 1.4;4 4.4];

[C, ia, ib] = intersect(A(:,1),B(:,1));

D = [A(ia,:), B(ib,2:end)]
ysakamoto
  • 2,512
  • 1
  • 16
  • 22
2

Have a look at ismember, which is very efficient with sorted inputs, as the indexes are in this case:

A = [1 2.56;
     2  7.4;
     3  7.7];
B = [1  7.1;
     3  1.4;
     4  4.4];

[tf,locb] = ismember(A(:,1),B(:,1))

You only need the two outputs this way: tf is a map of rows in A that also exist in B, locb is the location in B for each element in A (i.e. same length as A, zero where there is no match).

As such, a common idiom is to index locb with tf:

>> C = [A(tf,:) B(locb(tf),2:end)]
C =
    1.0000    2.5600    7.1000
    3.0000    7.7000    1.4000

Consider the speed difference between ismember and intersect when the inputs are sorted. Small-ish data:

N = 1e5;  A = [(1:N).' rand(N,1)];  B = [(1:N).' rand(N,1)];


>> tic; [tf,locb] = ismember(A(:,1),B(:,1)); toc
Elapsed time is 0.013419 seconds.
>> tic; [C, ia, ib] = intersect(A(:,1),B(:,1)); toc
Elapsed time is 0.050618 seconds.

Although ismember is several times faster, there is no big advantage for small data. However, with large sorted data sets, use ismember:

N = 1e7;  A = [(1:N).' rand(N,1)];  B = [(1:N).' rand(N,1)];

>> tic; [tf,locb] = ismember(A(:,1),B(:,1)); toc
Elapsed time is 0.892977 seconds.
>> tic; [C, ia, ib] = intersect(A(:,1),B(:,1)); toc
Elapsed time is 5.925537 seconds.

NOTE: If you want to really take advantage of a priori knowledge of sorted inputs, there is an undocumented function called ismembc that skips a call to issorted, making it even faster than ismember. See also here.

Community
  • 1
  • 1
chappjc
  • 30,359
  • 6
  • 75
  • 132
1

This can be solved neatly with the very versatile bsxfun function:

C1 = {1 2.56; 2 7.4; 3 7.7; 100 83.4};
C2 = {1 7.1; 3 1.4; 4 4.4; 76 7.2}; %// example data. Two-column cell arrays

comp = bsxfun(@eq, [C1{:,1}], [C2{:,1}].'); %'// test all pairs for equality
ind1 = any(comp,1); %// values of first col of C1 that coincide with some in C2
ind2 = any(comp,2); %// values of first col of C2 that coincide with some in C1
result = horzcat(C1(ind1,1), C1(ind1,2), C2(ind2,2)); %// build result

Note that

  • [C1{:,1}] is used to turn the first column of the cell array C1 into a (numeric) row vector. Here's why this works.
  • The second argument to any specifies which dimension it operates along.
  • ind1 and ind2 are logical indices.

Also, numeric arrays (matrices) in Matlab are more efficient than cell arrays. If all the cells in your cell array contain a single number, as in your case, consider using a numeric array instead. Cell arrays are useful when each cell must contain objects of different size or of different type (for example if one column contains numbers and another column contains strings).

If you use numeric arrays, the code also becomes slightly simpler in this case:

C1 = [1 2.56; 2 7.4; 3 7.7; 100 83.4];
C2 = [1 7.1; 3 1.4; 4 4.4; 76 7.2]; %// example data. Two-column matrices

comp = bsxfun(@eq, C1(:,1).', C2(:,1)); %'// test all pairs for equality
ind1 = any(comp,1); %// values of first col of C1 that coincide with some in C2
ind2 = any(comp,2); %// values of first col of C2 that coincide with some in C1
result = [C1(ind1,1) C1(ind1,2) C2(ind2,2)]; %// build result
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
0

Traditionally, I think @ysakamto's answer is best. However, if you're using a newer version of Matlab (last year?), they added a table datatype which supports SQL-type operations such as join. http://www.mathworks.com/help/matlab/ref/join.html

schodge
  • 891
  • 2
  • 16
  • 29
  • `ismember` is very efficient with sorted inputs, which they are in this case. – chappjc Feb 28 '14 at 21:36
  • That's very possible, only @yskamto's answer was posted when I started writing mine. Interesting the number of useful solutions to this problem. Perhaps the OP will implement them all on his data and time them for us? – schodge Feb 28 '14 at 22:11
  • I added a performance comparison to my answer. `ismember` is several times faster, and the difference becomes large as the data gets big. – chappjc Mar 01 '14 at 01:23
  • Interesting, thanks. I actually meant the original asker, didn't mean to suggest more work for you, but seeing performance data on alternate solutions is always extremely helpful. Thanks for posting it. – schodge Mar 01 '14 at 02:32
  • I know. I'm here for the fun after all, not the good pay! ;) – chappjc Mar 01 '14 at 02:33