310

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.

How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?

The join() function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.

smci
  • 32,567
  • 20
  • 113
  • 146
lollercoaster
  • 15,969
  • 35
  • 115
  • 173
  • 4
    You don't need a multiindex. It states in the join docs that of you don't have a multiindex when passing multiple columns to join on then it will handle that. – cwharland May 15 '14 at 03:29
  • 2
    In my trials, `df1.join([df2, df3], on=[df2_col1, df3_col1])` didn't work. – lollercoaster May 15 '14 at 06:50
  • 1
    You need to chain them together like in the answer given. Merge df1 and df2 then merge the result with df3 – cwharland May 15 '14 at 12:25
  • Creating an empty data fram and appending columns to it is the simplest solution I found after getting errors from solutions provided below: https://onecompiler.com/python/3zhtqjrut Source: https://www.geeksforgeeks.org/how-to-create-an-empty-dataframe-and-append-rows-columns-to-it-in-pandas/?ref=gcse – Lod Aug 17 '23 at 17:11

12 Answers12

639

Zero's answer is basically a reduce operation. If I had more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):

dfs = [df0, df1, df2, ..., dfN]

Assuming they have a common column, like name in your example, I'd do the following:

import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name'), dfs)

That way, your code should work with whatever number of dataframes you want to merge.

Kit
  • 30,365
  • 39
  • 105
  • 149
  • 20
    I just tried using this and it failed because `reduce` was replaced with [`functools.reduce`](https://stackoverflow.com/questions/8689184/nameerror-name-reduce-is-not-defined-in-python) So `import functools functools.reduce(.......)` – MattR Jun 07 '17 at 20:34
  • 5
    How will this solution work if I the names of the fields to join are different? For example, in three data frames I could have `name1`, `name2` and `name3` respectively. – ps0604 Apr 10 '18 at 10:16
  • 2
    Doesn't this mean that we have `n-1` calls to the merge function? I guess in this case where the number of dataframes is small it doesn't matter, but I wonder if there's a more scalable solution. – eapolinario Jun 21 '18 at 20:09
  • 2
    This didn't quite work for my `df`s with column multi indexes (it was injecting the 'on' as a column which worked for the first merge, but subsequent merges failed), instead I got it to work with: `df = reduce(lambda left, right: left.join(right, how='outer', on='Date'), dfs)` – Adrian Torrie Jan 31 '19 at 07:25
  • 1
    +1 to ps0604. what if the join columns are different, does this work? should we go with pd.merge incase the join columns are different? thanks – steve May 21 '20 at 07:48
  • Thanks! What if `df0,df1` have same columns to merge on and `df0,df2` have same columns to merge on? – haneulkim Feb 09 '21 at 10:06
  • Thanks! This answer preserves the columns, other than the one mentioned in `on`, which is absolutely correct. `concat` wouldn't do it. – Alex S. Feb 21 '22 at 10:21
  • What if they have no same column, but they only have same default column at the beginning of the dataframe? How to deal with this? – curiouscheese Mar 24 '22 at 10:38
173

You could try this if you have 3 dataframes

# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')

alternatively, as mentioned by cwharland

df1.merge(df2,on='name').merge(df3,on='name')
Zero
  • 74,117
  • 18
  • 147
  • 154
  • 49
    For cleaner looks you can chain them `df1.merge(df2,on='name').merge(df3,on='name')` – cwharland May 15 '14 at 12:26
  • 2
    How will this solution work if I the names of the fields to join are different? For example, in three data frames I could have `name1`, `name2` and `name3` respectively – ps0604 Apr 10 '18 at 10:31
  • 8
    @ps0604 `df1.merge(df2,left_on='name1', right_on='name2').merge(df3,left_on='name1', right_on='name3').drop(columns=['name2', 'name3']).rename(columns={'name1':'name'})` – Michael H. Jun 01 '18 at 08:32
  • 1
    and further, how to do this using the index. Doesn't seem to work if 'name' is the index and not a column name. – Brian D Feb 17 '19 at 22:22
158

This is an ideal situation for the join method

The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.

The code would look something like this:

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

With @zero's data, you could do this:

df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])

     attr11 attr12 attr21 attr22 attr31 attr32
name                                          
a         5      9      5     19     15     49
b         4     61     14     16      4     36
c        24      9      4      9     14      9
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • 15
    Joining all of the dfs to an empty dataframe also works: `pd.DataFrame().join(dfs, how="outer")`. This can be cleaner in some situations. – Dominik Jul 02 '19 at 10:05
  • 8
    This is decent advice and has now been incorporated into [pandas merging 101](https://stackoverflow.com/a/53645883/4909087) (see the section on merging multiple dataframes). It's worth noting that if your join keys are unique, using `pd.concat` will result in simpler syntax: `pd.concat([df.set_index('name') for df in dfs], axis=1, join='inner').reset_index()`. `concat` is also more versatile when dealing with duplicate column names across multiple dfs (`join` isn't as good at this) although you can only perform inner or outer joins with it. – cs95 Jul 28 '19 at 06:26
  • `dfs[0].join(dfs[1:])` should be edited to `dfs[0].join(dfs[1:], sort=False)` because otherwise a `FutureWarning` will pop up. Thanks for the nice example. – gies0r Apr 19 '20 at 22:05
  • 3
    I get an error on trying that: `ValueError: Indexes have overlapping values`, although, by inspection of the individual dataframes in the list, they don't seem to have overlapping values. – SomJura Jun 01 '20 at 10:53
39

In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining:

pd.concat(
    objs=(iDF.set_index('name') for iDF in (df1, df2, df3)),
    axis=1, 
    join='inner'
).reset_index()

where df1, df2, and df3 are defined as in John Galt's answer:

import pandas as pd
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32']
)
Dr Fabio Gori
  • 1,105
  • 16
  • 21
  • 6
    This should be the accepted answer. It's the fastest. – R. Zhu May 01 '20 at 22:12
  • What if dataframe shapes are different? – Abhilash Ramteke Sep 06 '21 at 18:55
  • @AbhilashRamteke If you mean that they have different number or rows (so the `name` column is not the same in all data frames) then `join='outer'` should preserve them all, but you will have missing values. No issues with respect to different column sets, as long as they all share the `name` column, which is used for index – Dr Fabio Gori Oct 04 '21 at 06:26
21

This can also be done as follows for a list of dataframes df_list:

df = df_list[0]
for df_ in df_list[1:]:
    df = df.merge(df_, on='join_col_name')

or if the dataframes are in a generator object (e.g. to reduce memory consumption):

df = next(df_list)
for df_ in df_list:
    df = df.merge(df_, on='join_col_name')
Alex
  • 12,078
  • 6
  • 64
  • 74
11

Simple Solution:

If the column names are similar:

 df1.merge(df2,on='col_name').merge(df3,on='col_name')

If the column names are different:

df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})
Gil Baggio
  • 13,019
  • 3
  • 48
  • 37
8

One does not need a multiindex to perform join operations. One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)

The join operation is by default performed on index. In your case, you just have to specify that the Name column corresponds to your index. Below is an example

A tutorial may be useful.

# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'],         index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'],     index=name)
df = df1.join(df2)
df = df.join(df3)

# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')

# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))

gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')
menuka
  • 61
  • 8
Guillaume Jacquenot
  • 11,217
  • 6
  • 43
  • 49
8

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:

This is the function to merge a dict of data frames

def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
  keys = dfDict.keys()
  for i in range(len(keys)):
    key = keys[i]
    df0 = dfDict[key]
    cols = list(df0.columns)
    valueCols = list(filter(lambda x: x not in (onCols), cols))
    df0 = df0[onCols + valueCols]
    df0.columns = onCols + [(s + '_' + key) for s in valueCols] 

    if (i == 0):
      outDf = df0
    else:
      outDf = pd.merge(outDf, df0, how=how, on=onCols)   

  if (naFill != None):
    outDf = outDf.fillna(naFill)

  return(outDf)

OK, lets generates data and test this:

def GenDf(size):
  df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
                      'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True), 
                      'col1':np.random.uniform(low=0.0, high=100.0, size=size), 
                      'col2':np.random.uniform(low=0.0, high=100.0, size=size)
                      })
  df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
  return(df)


size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}   
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)
rz1317
  • 99
  • 1
  • 2
5

There is another solution from the pandas documentation (that I don't see here),

using the .append

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
   A  B
0  5  6
1  7  8
>>> df.append(df2, ignore_index=True)
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.

If there are different column names, Nan will be introduced.

Sylhare
  • 5,907
  • 8
  • 64
  • 80
  • it's semantic, for someone using the word "join" to say putting together the two dataframe. (not necessarely as the SQL join operation) – Sylhare May 13 '18 at 01:57
2

I tweaked the accepted answer to perform the operation for multiple dataframes on different suffix parameters using reduce and i guess it can be extended to different on parameters as well.

from functools import reduce 

dfs_with_suffixes = [(df2,suffix2), (df3,suffix3), 
                     (df4,suffix4)]

merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=sfx)

merged = reduce(lambda left,right:merge_one(left,*right), dfs_with_suffixes, df1)
Siddhant Tandon
  • 651
  • 4
  • 15
  • Tweaked approach is great; however, a small fix must be added to avoid *ValueError: too many values to unpack (expected 2)*, a left suffices as empty string `""`. The final merge function could be as follow: `merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=('', sfx)) # Left gets no suffix, right gets something identifiable` – Marukox Jul 06 '21 at 17:27
1
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['d', 14, 16]]

),
    columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['c', 4, 36],
    ['d', 14, 9]]),
    columns=['name', 'attr31', 'attr32']
)
df4 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['c', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr41', 'attr42']
)

Three ways to join list dataframe

  1. pandas.concat
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
# cant not run if index not unique 
dfs = pd.concat(dfs, join='outer', axis = 1) 
  1. functools.reduce
dfs = [df1, df2, df3, df4]
# still run with index not unique 
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name', how = 'outer'), dfs)
  1. join
# cant not run if index not unique 
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:], how = 'outer')
BinhNN
  • 87
  • 7
1

Joining together all three can be done using .join() function.

You have three DataFrames lets say df1, df2, df3. To join these into one DataFrame you can:

df = df1.join(df2).join(df3)

This is the simplest way I found to do this task.

M Bilal Ayaz
  • 119
  • 8