How should I combine tables based on a common column?

Question

I am currently trying to combine two different datasets with an identical column called Ccode using the following method:

import pandas as pd
data_a = pd.read_csv(r'system.csv', encoding = 'cp949')
data_b = pd.read_csv(r'Seoul.csv', encoding = 'cp949')
pd.merge(data_a, data_b, how = 'left', on = 'Ccode')

Instead of getting a combined table this error message keeps popping up:

MemoryError: Unable to allocate 73.7 GiB for an array with shape (162, 61021050) and data type int64

Should I try a different method or was there something wrong with my code?

EDIT: Here's a sample of the data I'm working with:

data_a = pd.DataFrame({'Ccode': [11260, 11203, 12121, 13101, 11002], 'Dname': ['Jonggu', 'Jongnogu', 'Seongbukgu', 'Mapogu', 'Dongdaemungu'], Xcoor [205310, 210191, 199768, 200974, 198397], Ycoor[445727, 446339, 452273, 451975, 451624]}, 
                      columns=['Ccode', 'Dname', 'Xcoor', 'Ycoor'])




data_b = pd.DataFrame({'Ccode': [12260, 11133, 11001, 11591, 10000], 'Acode': ['11', '11', '11', '11', '11'], Opostc [135080, 153010, 143200, 157812, 138735], Npostc[6149, 8545, 4992, 7619, 5510]}, 
                          columns=['Ccode', 'Acode', 'Opostc', 'Npostc'])

There are a total of 33 columns in data_a and 168 columns in data_b. The only column that the two data sets share is 'Ccode',

Does this answer your question? [Unable to allocate array with shape and data type](https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type) — Mayank Porwal, Apr 07 '20 at 06:59
can you please add sample data frames, for better understanding. — Tshiteej, Apr 07 '20 at 07:01
@Mayank Porwal Well not exactly but hey at least it's not a problem unique to me. I'm still puzzled as to why the resulting file is so big though. — danielcoben, Apr 07 '20 at 07:54
The code works fine for me. Can you check [this](https://trainingsupport.microsoft.com/en-us/tcmpd/forum/all/i-am-getting-below-error-memoryerror-unable-to/badac065-c35b-4853-9122-e7607e40ecae) and see if this helps. — Tshiteej, Apr 07 '20 at 08:38

jadianes · Answer 1 · 2020-04-07T07:53:30.063

1

Why not:

import pandas as pd
data_a = pd.read_csv(r'system.csv', encoding = 'cp949')
data_b = pd.read_csv(r'Seoul.csv', encoding = 'cp949')
data_a.join(data_b.set_index('Copde'), on = 'Ccode')

Although it seems you're having some memory problems anyway (i.e. the result file is too big to fit in memory).

My guess is that because of not specifying the indexes, merge is trying to use the default numeric index when you load the data frames, and it might end up on that explosion of columns that you have. Or maybe the column Ccode in both data frames doesn't match data types (check for NAs if they both are supposed to be int cause it could be converted to float or object).

edited Apr 07 '20 at 07:53

answered Apr 07 '20 at 07:34

jadianes

516
5
10

Thank you so much for the solution. Is it normal for the resulting data to take up this much space? I mean, the original files were around 500MB and 1.2GB so I was expecting something around 2~3GB for the end result. – danielcoben Apr 07 '20 at 07:43
I just fixed a typo (`join` is an instance method). With those dataframe sizes, depending on the index overlap and which one goes on the left, it shouldn't get bigger than that, yes. The problem you're having is likely to be related with the indexes not being specified to be on `Ccode`, or it might even be that in each data frame `Ccode` is a different type. For example if it's an integer in one of the DF but float in the other because of some NA, it could end up not joining properly. So check for NAs too (use `isna()`) – jadianes Apr 07 '20 at 07:50

How should I combine tables based on a common column?

1 Answers1