THIS QUESTION IS NOT ONLY FOR MATLAB USERS - If you know an answer to the problem in PSEUDOCODE, then feel free to leave your answer as well!
I have two tables Ta and Tb that have different number of rows and different number of columns. The content is all cell text, but maybe in the future it could also contain cell number.
I want to merge the content of these tables together under the following set of rules:
- Take the value of
Ta(i,j)
ifTb(i*,j*)
is empty and vice versa. - If both are available, then take the value of
Ta(i,j)
(and optionally, check whether they are the same).
The tricky part however is that we do not have unique row keys, we only have unique column keys. Note above that I make a distinction between i*
and i
. The reason is that the row in Ta can be at a different index than Tb, same holds for the columns j*
and j
. The implications are:
- that we first need to identify which row for Ta corresponds to the row of Tb and vice versa. We can do this by trying to crossmatch any of the columns that the tables share in common. However, we might not find a match (in that case we do not merge a row with another one).
Question
How can we merge the content of these two tables together in the most efficient way?
Here are some resources to explain the question in more detail:
1. Matlab example to play with:
Ta = cell2table({...
'a1', 'b1', 'c1'; ...
'a2', 'b2', 'c2'}, ...
'VariableNames', {'A','B', 'C'})
Tb = cell2table({...
'b2*', 'c2', 'd2'; ...
'b3', 'c3', 'd3'; ...
'b4', 'c4', 'd4'}, ...
'VariableNames', {'B','C', 'D'})
The resulting table Tc should be something like this:
Tc = cell2table({...
'a1' 'b1' 'c1' ''; ...
'a2' 'b2' 'c2' 'd2'; ...
'' 'b3' 'c3' 'd3'; ...
'' 'b4' 'c4' 'd4'}, ...
'VariableNames', {'A', 'B','C', 'D'})
2. A possible first step
I tried the following:
Tc = outerjoin(Ta, Tb, 'MergeKeys', true)
Which works smooth, but the problem is that it lacks the stacking of rows that seem similar. E.g. the above command produces:
A B C D
____ _____ ____ ____
'' 'b2*' 'c2' 'd2'
'' 'b3' 'c3' 'd3'
'' 'b4' 'c4' 'd4'
'a1' 'b1' 'c1' ''
'a2' 'b2' 'c2' ''
Here the rows
'' 'b2*' 'c2' 'd2'
'a2' 'b2' 'c2' ''
Should have been merged into one:
'a2' 'b2' 'c2' 'd2'
So we need one more step to stack those two together?
3. Example of a hurdle
If we have something like:
Ta =
A B C
____ _____ ____
'a1' 'b1' 'c1'
'a2' 'b2' 'c2'
Tb =
A B C
____ _____ ____
'a1' 'b2' 'c3'
then the question arises whether the row in b should be merged with row 1 or row 2 of a or should all rows be merged or just put as a separate row? An idea on how to handle these type of situations would be nice as well.