3

THIS QUESTION IS NOT ONLY FOR MATLAB USERS - If you know an answer to the problem in PSEUDOCODE, then feel free to leave your answer as well!


I have two tables Ta and Tb that have different number of rows and different number of columns. The content is all cell text, but maybe in the future it could also contain cell number.

I want to merge the content of these tables together under the following set of rules:

  • Take the value of Ta(i,j) if Tb(i*,j*) is empty and vice versa.
  • If both are available, then take the value of Ta(i,j) (and optionally, check whether they are the same).

The tricky part however is that we do not have unique row keys, we only have unique column keys. Note above that I make a distinction between i* and i. The reason is that the row in Ta can be at a different index than Tb, same holds for the columns j* and j. The implications are:

  • that we first need to identify which row for Ta corresponds to the row of Tb and vice versa. We can do this by trying to crossmatch any of the columns that the tables share in common. However, we might not find a match (in that case we do not merge a row with another one).

Question

How can we merge the content of these two tables together in the most efficient way?


Here are some resources to explain the question in more detail:

1. Matlab example to play with:

Ta = cell2table({...
     'a1', 'b1', 'c1'; ...
     'a2', 'b2', 'c2'}, ...
      'VariableNames', {'A','B', 'C'})
Tb = cell2table({...
     'b2*', 'c2', 'd2'; ...
     'b3', 'c3', 'd3'; ...
     'b4', 'c4', 'd4'}, ...
      'VariableNames', {'B','C', 'D'})

The resulting table Tc should be something like this:

Tc = cell2table({...
    'a1' 'b1' 'c1'   ''; ...
    'a2' 'b2' 'c2' 'd2'; ...
    ''   'b3' 'c3' 'd3'; ...
    ''   'b4' 'c4' 'd4'}, ...
     'VariableNames', {'A', 'B','C', 'D'})

2. A possible first step

I tried the following:

Tc = outerjoin(Ta, Tb, 'MergeKeys', true)

Which works smooth, but the problem is that it lacks the stacking of rows that seem similar. E.g. the above command produces:

 A        B       C       D  
____    _____    ____    ____
''      'b2*'    'c2'    'd2'
''      'b3'     'c3'    'd3'
''      'b4'     'c4'    'd4'
'a1'    'b1'     'c1'    ''  
'a2'    'b2'     'c2'    '' 

Here the rows

''      'b2*'    'c2'    'd2'
'a2'    'b2'     'c2'    '' 

Should have been merged into one:

'a2'    'b2'     'c2'    'd2' 

So we need one more step to stack those two together?


3. Example of a hurdle

If we have something like:

Ta = 
     A        B       C       
    ____    _____    ____
    'a1'    'b1'     'c1' 
    'a2'    'b2'     'c2'

Tb = 
     A        B       C       
    ____    _____    ____
    'a1'    'b2'     'c3' 

then the question arises whether the row in b should be merged with row 1 or row 2 of a or should all rows be merged or just put as a separate row? An idea on how to handle these type of situations would be nice as well.

WJA
  • 6,676
  • 16
  • 85
  • 152
  • 2
    This is pretty similar to your [previous question](https://stackoverflow.com/questions/46682751/efficient-ways-to-append-new-data-in-matlab-with-example-code) right? – Wolfie Oct 17 '17 at 13:42
  • Not really, because I am really aiming for how to use Matlab Table to join together two tables. It is not the same as with the previous question where I distinguish between rows and columns and where I am dealing with numeric data - if you can show me the link with the previous question that would be great. – WJA Oct 17 '17 at 13:47
  • Note also that in this question, there are no unique rows. Its just that the number of rows is different. – WJA Oct 17 '17 at 13:51
  • Actually what I am trying to do is to fill the table Tc with as much as values as possible. I have two Referential tables and sometimes there are some cells missing in A, but they are available in B, or vice versa. So I am trying to use all the information from A or B and put them together. It might indeed happen that A and B both have values but that they are different. In that case, ideally I would like an error warning (but thats optional), and to take over the value from A. – WJA Oct 17 '17 at 14:02
  • See edit. b2 and b2* are different on row 2, where a2 is the same value in both tables. – WJA Oct 17 '17 at 14:07
  • I am adding asterisk to show that b2 is different from b2* - see edit. I am not getting why this question is being downvoted. It is a honest question and struggle, what is wrong with it? – WJA Oct 17 '17 at 14:10
  • Because I am using then the value from table A when there is a mismatch between A and B (in case when both values present and they are different) – WJA Oct 17 '17 at 14:12
  • Thats where I am asking for help. Its not always 1,2,3. The number of rows is different each time. The tables A and B are referential data, including security information. For instance, Table A might have a Bloomberg Ticker, Table B might only have the SecCode for a certain asset. Or sometimes an asset is in A but not in B. – WJA Oct 17 '17 at 14:14
  • Construct the tables Ta and Tb. Then construct Tc by the syntax above. Then you'll see how it is filled. The values itself are examples. The idea is to merge ALL the information from Ta and Tb together. Each column can be an identifier along a row per table. We have to find common ground between Ta and Tb and construct Tc. – WJA Oct 17 '17 at 14:22
  • Potentially another variable that does match on a row. Eg. in our example above, c2 is the common ground for Ta and Tb. – WJA Oct 17 '17 at 14:26
  • Should not happen, as the rows represent each security. Its more the problem that you 1) find common ground and that 2) some other column says that there is a mismatch. So in that case, it will be a mismatch once common ground has been established for row1. – WJA Oct 17 '17 at 14:35
  • Yes that should be correct – WJA Oct 17 '17 at 14:38
  • I have updated the question to reflect the problem better. Hope this makes it more clear. – WJA Oct 18 '17 at 08:07

2 Answers2

3

Here is a conceptual answer, that could get you on the way:

  1. Define a 'scoring function' that tells you per row of Tb how good it matches a row in Ta.
  2. Fill Tc with Ta
  3. For each row in Ta, determine the best match with Tb. If the match quality is above your criterium, define the best match match to be a succesfull match.
  4. If a succesfull match was found, 'consume' it (use the info from Tb to enrich the corresponding row in Tc where needed)
  5. Keep going till you reach the end of Ta, whatever has not been consumed from Tb can now be 'appended' to Tc.

Room for improvement:

Note on choice of matches

Play around with consuming Ta instead of Tb, or use a more complex heuristic to determine the consumption order (e.g. calculate all 'distances' and optimize the matching based on a cost function).

Note that these improvements are only neccesary if you get a lot of false positives with your matches in the basic solution.

Note on definition of match quality

I would recommend you to start very simple with this, for instance if you have 4 fields, simply count how many fields match, or whether all nonempty fields match.

If you want to go further, consider evaluating how far the values are apart (e.g. mse) or how far the texts are apart (e.g. levensteihn distance).

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
  • I really like this. Especially the scoring function is a good idea, which will allow you to play with it such to boost the speed. – WJA Oct 19 '17 at 14:05
3

Here is a function that attempts to do the job. You feed in the two tables, a threshold by which you decide whether you merge two rows, and a logical to state whether you prefer to take values from the first table when merge conflict emerges. I did not prepare for extreme cases but see where it gets you with:

TkeepAll=mergeTables(Tb,Ta,1,true)
TmergeSome=mergeTables(Tb,Ta,0.25,true)
TmergeAll=mergeTables(Tb,Ta,-1,true)

here is the function:

function Tmerged=mergeTables(Ta,Tb,threshold,preferA)
%% parameters
% Ta and Tb are two the two tables to merge
% threshold=0.25; minimal ratio of identical values in rows for merge.
%   example: you have one row in table A with 3 values, but you only have two
%   values for the same columns in data B. if one of the values is identical
%   and one isn't, you have ratio of 1/2 aka 0.5, which passes a threshold of
%   0.25
% preferA=true; which to take when there is merge conflict
%% see how well rows fit to each other
% T1 is the table with fewer rows
if size(Ta,1)<=size(Tb,1)
    T1=Ta;
    T2=Tb;
    prefer1=preferA;
else
    T1=Tb;
    T2=Ta;
    prefer1=~preferA;
end
[commonVar1,commonVar2]=ismember(T1.Properties.VariableNames,...
    T2.Properties.VariableNames);
commonVar1=find(commonVar1);
commonVar2(commonVar2==0)=[];
% fit is a table with the size of N rows T1 by M rows T2, with values
% describing what ratio of identical items between each row in
% table 1 (shorter) and each row in table 2 (longer), among all not-missing
% points
for ii=1:size(T1,1) %rows of T1
    for jj=1:size(T2,1)
        fit(ii,jj)=sum(ismember(T1{ii,commonVar1},T2{jj,commonVar2}))/length(commonVar1);
    end
end
%% pair rows according to fit
% match has two columns, first one has T1 row number and secone one has the
% matching T2 row number
unpaired1=true(size(T1,1),1);
unpaired2=true(size(T2,1),1);
count=0;
match=[];
maxv=max(fit,[],2);
[~,order]=sort(maxv,'descend');
order=order';
for ii=order %1:size(T1,1)
    [maxv,maxi]=max(fit,[],2);
    if maxv(ii)>threshold
        count=count+1;
        match(count,1)=ii;
        match(count,2)=maxi(ii);
        unpaired1(ii)=false;
        unpaired2(match(count,2))=false;
        fit(:,match(count,2))=nan; %exclude paired row from next pairing
    end
end

%% prepare new variables
% first variables common to the two tables
Nrows=sum(unpaired1)+sum(unpaired2)+size(match,1);
namesCommon={};
namesCommon(1:length(commonVar1))={T1.Properties.VariableNames{commonVar1}};
for vari=1:length(commonVar1)
    if isempty(match)
        mergedData={};
    else
        if prefer1
            mergedData=T1{match(:,1),commonVar1(vari)}; %#ok<*NASGU>
        else
            mergedData=T2{match(:,2),commonVar2(vari)};
        end
    end
    data1=T1{unpaired1,commonVar1(vari)};
    data2=T2{unpaired2,commonVar2(vari)};
    eval([namesCommon{vari},'=[data1;mergedData;data2];']);
end
% variables only in 1
uncommonVar1=1:size(T1,2);
uncommonVar1(commonVar1)=[];
names1={};
names1(1:length(uncommonVar1))={T1.Properties.VariableNames{uncommonVar1}};
for vari=1:length(uncommonVar1)
    data1=T1{:,uncommonVar1(vari)};
    tmp=repmat({''},Nrows-size(data1,1),1);
    eval([names1{vari},'=[data1;tmp];']);
end
% variables only in 2
uncommonVar2=1:size(T2,2);
uncommonVar2(commonVar2)=[];
names2={};
names2(1:length(uncommonVar2))={T2.Properties.VariableNames{uncommonVar2}};
for vari=1:length(uncommonVar2)
    data2=T2{:,uncommonVar2(vari)};
    tmp=repmat({''},Nrows-size(data2,1),1);
    eval([names2{vari},'=[tmp;data2];']);
end
%% collect variables to a table
names=sort([namesCommon,names1,names2]);
str='table(';
for vari=1:length(names)
    str=[str,names{vari},','];
end
str=[str(1:end-1),');'];
Tmerged=eval(str);
Yuval Harpaz
  • 1,416
  • 1
  • 12
  • 16