2

So, I have two Excel tables that I am looking to merge, on my MacBook. They're quite large. One has dimensions (42912, 6) and the other has dimensions (41424,7). Therefore, here are a couple of examples of what they look like:

Table A

| Country       | Continent         |
|-------------  |---------------    |
| UK            | Europe            |
| France        | Europe            |
| Germany       | Europe            |
| USA           | North America     |
| New Zealand   | Oceania           |

Table B

| Country       | City      | Population    |
|-------------  |---------- |------------   |
| UK            | London    | 8,900,000     |
|               | Bristol   | 53,907        |
| France        | Paris     | 2,141,000     |
| USA           | New York  | 8,623,000     |
| New Zealand   | Auckland  | 1,657,000     |

You can see some of the differences between the two. For example, whilst Table A lists 5 countries, Table B is missing one of them - Germany. In addition. Table B has 2 cities listed under UK, where Table A only has one row.

Basically, I want to be able to merge the two tables, so they look like this:

| Country       | Continent         | City      | Population    |
|-------------  |---------------    |---------- |------------   |
| UK            | Europe            | London    | 8,900,000     |
|               |                   | Bristol   | 53,907        |
| France        | Europe            | Paris     | 2,141,000     |
| Germany       | Europe            |           |               |
| USA           | North America     | New York  | 8,623,000     |
| New Zealand   | Oceania           | Auckland  | 1,657,000     |

The two tables are being merged by the Country string/column which they both have in common, effectively adding the remaining relevant rows from Table B, to Table A.

When Table A has a row that Table B doesn't, the columns are just left empty (Germany row as an example). When Table B has multiple rows in the same country, additional rows are created in Table A (UK as an example) to accommodate.

I would really appreciate some help in understanding how I can merge these two (given they have over 40k rows each) rather than doing it line by line in Excel, which would take ages.

I'm quite happy to work with Python or Matlab etc., or even simpler - a way to do it in Excel for Mac.

Thanks!

gehbiszumeis
  • 3,525
  • 4
  • 24
  • 41
James
  • 669
  • 1
  • 10
  • 21
  • 2
    You're basically trying to apply a `full outer join` between two table. In matlab you can use the function `outerjoin(TableA,TableB)`.Check how it can be done using [excel](https://superuser.com/questions/1023123/how-to-simulate-a-full-outer-join-in-excel), [python](https://stackoverflow.com/questions/47504975/joining-two-dataframes-in-pandas-using-full-outer-join), or [matlab](https://mathworks.com/help/matlab/ref/outerjoin.html) – obchardon Jan 16 '20 at 13:43
  • Yes that seems like it, thanks! I'll give it a go soon, to see if it works. I had been trying it with the `join` function although I kept running into many errors, including because of the different numbers of rows – James Jan 16 '20 at 13:48
  • I am seconding the comment of @obchardon. However, a simple `outerjoin` won't work with missing values in the column you want to merge on, i.e. `Country`. Please see my answer of how to deal with that in python and MATLAB. – gehbiszumeis Jan 17 '20 at 08:28

1 Answers1

2

A problem in your data are missing Country entries for lines like the one containing Bristol. Since you want to merge on the Country key, merge routines won't know how to merge those lines in the form your tables currently have. Doing some data wrangling before will help, as described below.

Python solution:

Use pandas to read both excel files and use pd.merge() for the merge. pd.DataFrame.ffill() will solve the "Bristol" problem described above.

import pandas as pd


def xls2df_with_ffill(xls_path, fill_key):
    """
    reads xls to dataframe and fills empty `Country` entries with the values
    of the previous row.
    """
    df = pd.read_excel(xls_path)
    df.loc[:, fill_key] = df.loc[:, fill_key].ffill()

    return df


df1 = xls2df_with_ffill('stackoverflow1.xlsx', 'Country')
df2 = xls2df_with_ffill('stackoverflow2.xlsx', 'Country')

# do the merge and write to new excel file
merged_df = pd.merge(df1, df2, how='outer', on='Country')
merged_df.to_excel('stackoverflow_merged.xlsx')

MATLAB solution:

MATLAB > version 2013b provides the table data type, which is as convenient for join-purposes like yours as a pd.DataFrame is in Python. To solve the missing Country value problem, MATLAB provides function fillmissing. Merge the tables with outerjoin.

table1 = xls2table_with_fillmissing('stackoverflow1.xlsx', 'Country');
table2 = xls2table_with_fillmissing('stackoverflow2.xlsx', 'Country');

% do the merge and write to new excel file
merged_table = outerjoin(table1, table2, 'Type', 'Left', ...
    'MergeKeys', true);
writetable(merged_table, 'stackoverflow_merged.xlsx')


function table = xls2table_with_fillmissing(xls_path, fill_key)
% reads xls to table and fills empty `Country` entries with the values of
% the previous row. 
    table = readtable(xls_path);
    table(:, fill_key) = fillmissing(table(:, fill_key), 'previous');
end

In both cases functions the result is an excel table looking like this:

enter image description here

As you can see, due to application of ffill() / fillmissing, there is an UK filled to the Country column in the Bristol row. Besides of that, it is matching your expectations.

gehbiszumeis
  • 3,525
  • 4
  • 24
  • 41
  • Thanks so much for the detailed reply. I really appreciate it. That certainly ticks the boxes for me! It seems a lot simpler than I first thought. I was just missing little bits, such as the fillmissing function. – James Jan 18 '20 at 09:10