Missing columns when merging multiple dataframes based on column value

Question

I have five dataframes each has unique columns and one common column id no.. Say each dataframe has the following columns:

df1: id no, time, date, age, name.
df2: id no, gender, address, employment, birth_date.
df3: id no, .....etc.
df4: id no, ......etc.
df5: id no, .......etc.

I have used merge as the following:

first1 = pd.merge(df1, df2, how= 'left', on = 'id_no')
first2 = pd.merge(first1, df3, how= 'left', on = 'id_no')
first3 = pd.merge(first2, df4, how= 'left', on = 'id_no')
combineall = pd.merge(first3, df5, how= 'left', on = 'id_no')

The problem: The columns of df3 are missing from combineall dataframe. Though when I print df3 alone I see all the contents available. How my df3 gets missing during merging? Is there I make this process easier and less problematic.

Goal: I want to have comabinall dataframe with the all the columns from df1,df2,df3,df4,df5 merged based on id_no.

score 0 · Answer 1 · answered Feb 21 '18 at 06:43

0

Try out this:

from functools import reduce
finaldf = reduce(lambda left,right: pd.merge(left, right, on='id_no', how='left'), [df1,df2,df3,df4,df5])

answered Feb 21 '18 at 06:43

Aditi

820
11
27

I still face the same issue. My df3 columns are empty in finaldf. – Firis Feb 21 '18 at 06:49
Try removing `how='left'` – Aditi Feb 21 '18 at 06:57

Soudipta Dutta · Answer 2 · 2022-12-28T06:41:46.770

import pandas as pd      
import numpy as np

df1 = pd.DataFrame(np.array([
    ['a', 1, 2],
    ['b', 3, 4],
    ['c', 5, 6]]),
    columns=['id', 'A', 'B'])
df2 = pd.DataFrame(np.array([
    ['a', 10, 12],
    ['b', 13, 14],
    ['c', 15, 16]]),
    columns=['id', 'C', 'D'])
df3 = pd.DataFrame(np.array([
    ['a', 100, 120],
    ['b', 130, 140],
    ['c', 150, 160]]),
    columns=['id', 'E', 'F'])

dfs = [df1, df2, df3]


dfs = [df.set_index('id') for df in dfs]
aa = dfs[0].join(dfs[1:])
aar= aa.reset_index()
print(aar)
"""
Explanation 1:
 1. Set the index of each dataframe to the id column.
 2. Join the dataframes together.
 3. Reset the index to get the id column back.
"""
"""
OUTPUT :
  id  A  B   C   D    E    F
0  a  1  2  10  12  100  120
1  b  3  4  13  14  130  140
2  c  5  6  15  16  150  160
"""

bb = pd.concat(
    objs = ( common_ID.set_index('id') for common_ID in (df1, df2, df3)     ),
    axis=1, 
    join='inner'
).reset_index()
print(bb)

"""
OUTPUT:
  id  A  B   C   D    E    F
0  a  1  2  10  12  100  120
1  b  3  4  13  14  130  140
2  c  5  6  15  16  150  160
  
""" 
"""
Explanation 2:
 
 1. set_index('id') for common_ID in (df1, df2, df3)
    - this is a generator expression that creates a generator object
    - the generator object is a sequence of dataframes with the common_ID as the index
 2. pd.concat(objs = generator_object, axis=1, join='inner')
    - this concatenates the dataframes in the generator object along the columns
    - the join='inner' ensures that only the common_ID is used
 3. .reset_index()
    - this resets the index to the default index

"""

cc = df1.merge(df2,on='id').merge(df3,on='id')
print(cc)

   import functools as ft
dd = ft.reduce(lambda x, y: pd.merge(x, y, on='id'), dfs)
print(dd)

Missing columns when merging multiple dataframes based on column value

2 Answers2