Merging multiple dataframes based on rows creates extra columns

Question

I have a MasterList dataframe that I merge other data sets into via a loop. Each time I merge a new column is created as with _x or _y. How can I just keep these as one column?

import pandas as pd
MasterList = pd.DataFrame(data = [['0001'],['0002'], ['0003'], ['0004']], columns = ['Order Number'])

customer_file1 = pd.DataFrame(data = [['0003', 'M'], ['0004', 'W']], columns = ['Order Number', 'Day'])
customer_file2 = pd.DataFrame(data = [['0001', 'T'], ['0002', 'S']], columns = ['Order Number', 'Day'])


for x in [customer_file1, customer_file2]:
    MasterList = pd.merge(MasterList, x, how='left',left_on= 'Order Number',right_on= 'Order Number')


print MasterList

Output:

  Order Number Day_x Day_y
0         0001   NaN     T
1         0002   NaN     S
2         0003     M   NaN
3         0004     W   NaN

Desired Output:

  Order Number   Day
0         0001     T
1         0002     S
2         0003     M
3         0004     W

Edit: People wanted more data as I over simplified my example: I know year and day don't really make sense in the data set buy that is ok. Each customer file really come from a query of a different database, so I'd like to make the query from the database then merge in the data and forget about it rather than querying all the customer databases, concatenating, then merging.

import pandas as pd
MasterList = pd.DataFrame(data = [['0001', '2015'],['0002', '2015'], ['0003', '2016'], ['0004', '2015'], ['0005', '2017'], ['0006', '2018']], columns = ['Order Number', 'Year'])

customer_file1 = pd.DataFrame(data = [['0003', 'M'], ['0004', 'W']], columns = ['Order Number', 'Day'])
customer_file2 = pd.DataFrame(data = [['0001', 'T'], ['0002', 'S']], columns = ['Order Number', 'Day'])
customer_file3 = pd.DataFrame(data = [['0005', 'T'], ['0006', 'S']], columns = ['Order Number', 'Day'])

for x in [customer_file1, customer_file2, customer_file3]:
    MasterList = pd.merge(MasterList, x, how='left', left_on='Order Number', right_on='Order Number')


print MasterList

output:

  Order Number  Year Day_x Day_y  Day
0         0001  2015   NaN     T  NaN
1         0002  2015   NaN     S  NaN
2         0003  2016     M   NaN  NaN
3         0004  2015     W   NaN  NaN
4         0005  2017   NaN   NaN    T
5         0006  2018   NaN   NaN    S

Desired output:

  Order Number  Year    Day 
0         0001  2015     T
1         0002  2015     S
2         0003  2016     M
3         0004  2015     W
4         0005  2017     T
5         0006  2018     S

Zero · Accepted Answer · 2017-09-29T18:05:02.687

2

Option 1] Use map and combine_first

In [5044]: MasterList['Day'] = np.nan
      ...: for x in [customer_file1, customer_file2]:
      ...:     day = MasterList['Order Number'].map(x.set_index('Order Number')['Day'])
      ...:     MasterList['Day'] = MasterList['Day'].combine_first(day)
      ...:

In [5045]: MasterList
Out[5045]:
  Order Number Day
0         0001   T
1         0002   S
2         0003   M
3         0004   W

Options 2]

Use merge and append

In [5032]: MasterList.merge(customer_file1.append(customer_file2))
Out[5032]:
  Order Number Day
0         0001   T
1         0002   S
2         0003   M
3         0004   W

Or use merge and concat

In [5033]: MasterList.merge(pd.concat([customer_file1, customer_file2]))
Out[5033]:
  Order Number Day
0         0001   T
1         0002   S
2         0003   M
3         0004   W

edited Sep 29 '17 at 18:05

answered Sep 29 '17 at 17:56

Zero

74,117
18
147
154

This was a simplified example. `customer_file1` and `customer_file2` are queried from a database and there are many of them. I'd like to merge in and then drop the data to save memory. – user2242044 Sep 29 '17 at 17:58
What happens if a `order number` is present in both `customer_file1` and `customer_file2` which one get's precedence? – Zero Sep 29 '17 at 18:02
OP, simply delete after the append/merge: `del customer_file1` (or in loop). Ideally, you import directly into a list and then delete the list (no individual objects). – Parfait Sep 29 '17 at 18:04
Updated with `map` and `combine_first` approach, that should do. – Zero Sep 29 '17 at 18:05

Bharath M Shetty · Answer 2 · 2017-09-29T18:04:58.093

2

Common mistake of merge use concat instead i.e

MasterList = pd.concat([customer_file2,customer_file1],ignore_index=True)

  Order Number Day
0         0001   T
1         0002   S
2         0003   M
3         0004   W

edited Sep 29 '17 at 18:04

answered Sep 29 '17 at 17:58

Bharath M Shetty

30,075
6
57
108

This was a simplified example. customer_file1 and customer_file2 are queried from a database and there are many of them. I'd like to merge in and then drop the data to save memory. Additionally MasterList has some other data I need. – user2242044 Sep 29 '17 at 17:58
Can you put more data so we get some clarification – Bharath M Shetty Sep 29 '17 at 18:00

score 0 · Answer 3 · answered Sep 29 '17 at 18:31

0

Base on your out put , just do following , you get your Desired output.

df.apply(lambda x: sorted(x, key=pd.isnull), 1).dropna(1)
Out[126]: 
   Order  Number  Year Day_x
0      0       1  2015     T
1      1       2  2015     S
2      2       3  2016     M
3      3       4  2015     W
4      4       5  2017     T
5      5       6  2018     S

answered Sep 29 '17 at 18:31

BENY

317,841
20
164
234

@Bharathshetty aha :) – BENY Sep 29 '17 at 18:49
@Bharathshetty also ,I asking the same question before https://stackoverflow.com/questions/45970751/shift-nans-to-the-bottom-of-their-respective-columns – BENY Sep 29 '17 at 18:52
1

I have used that as an answer haha https://stackoverflow.com/questions/46304830/condensing-pandas-dataframe-by-dropping-missing-elements/46305134#46305134 – Bharath M Shetty Sep 29 '17 at 18:54

Merging multiple dataframes based on rows creates extra columns

3 Answers3