Merge dataframes via columnwise in a directory with common columns

Question

I'm trying to merge multiple .txt files in a directory specifically merging on one of the common column X found in dataframes.

import pandas as pd

  df1 = pd.DataFrame({'X': ['X0', 'X1', 'X2', 'X3'],
       ...:                     'B': ['B0', 'B1', 'B2', 'B3'],
       ...:                     'C': ['C0', 'C1', 'C2', 'C3'],
       ...:                     'D': ['D0', 'D1', 'D2', 'D3']})


 df2 = pd.DataFrame({'X': ['X0', 'X1', 'X2', 'X3'],
   ...:                     'G': ['G0', 'G1', 'G2', 'G3'],
   ...:                     'H': ['H0', 'H1', 'H2', 'H3'],
   ...:                     'J': ['J0', 'J1', 'J2', 'J3']})

by following previous this post Pandas Merging 101 and this https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

I tried

data = pd.DataFrame()
for file in glob.iglob(filepath + '/*.txt', recursive=True):

    print(file)

    df=pd.read_csv(file, sep ='\t',header=0, skiprows=0, skipfooter=0, na_values=(""," ","NA"))
    data=pd.concat([data,df])

data_merge = pd.merge(data,on='X',how='inner')

but I got

TypeError: merge() missing 1 required positional argument: 'right'

How can I join two datafame by common X column?

the expected output

Thanks.

you can use `pd.merge` either with: `df1.merge(df2, on='X')`, or `pd.merge(df1, df2, on='X')`, they're the same thing — Dan, Feb 06 '20 at 15:58

Quang Hoang · Accepted Answer · 2020-02-06T16:04:43.317

3

concat is enough, you don't need merge. Do data = [], read the files in to dataframes, append them to data and concat once:

data = []
for file in glob.iglob(filepath + '/*.txt', recursive=True):

    print(file)

    df=pd.read_csv(file, sep ='\t',header=0, skiprows=0, skipfooter=0, na_values=(""," ","NA"))
    data.append(df.set_index('X'))

data_merge = pd.concat(data, axis=1)

This assumes that your X columns do not contain duplicates in all the dataframes. Otherwise, you need to do iterative merge inside the for loop without concat, somthing like:

final_df = None
for file in glob.iglob(filepath + '/*.txt', recursive=True):

    print(file)

    df=pd.read_csv(file, sep ='\t',header=0, skiprows=0, skipfooter=0, na_values=(""," ","NA"))
    if final_df is not None:
        final_df = final_df.merge(df, on='X', how='outer')
    else: 
        final_df = df

edited Feb 06 '20 at 16:04

answered Feb 06 '20 at 15:59

Quang Hoang

146,074
10
56
74

What if I have multiple common columns like ['Z', 'Q', 'W'] – Alexander Feb 06 '20 at 16:00
and `merge` is more way flexible than `concat` in terms of join options. – Alexander Feb 06 '20 at 16:01
1

`set_index(['Z','Q','W'])` instead of `set_index('X')`. You can also pass `index_cols=['Z','Q','W']` to `read_csv` and forget about `set_index`. – Quang Hoang Feb 06 '20 at 16:02
what does `if final_df is not None:` do here ? checkign the data file is emtpy or not ? – Alexander Feb 06 '20 at 16:15
It checks if this is the first file or not. You can also do `final_df = pd.DataFrame(columns='X')` and get rid of that `if` check. – Quang Hoang Feb 06 '20 at 16:17
Thanks! This is great! – Alexander Feb 06 '20 at 16:24

sawezo · Answer 2 · 2020-02-06T16:16:31.233

1

You aren't including the right dataframe to merge on, as in you are only giving the merge call a single dataframe when it needs both. Try: pd.merge(df1, df2, on="X")

Edit:

dataframe_list = []
for file in glob.iglob(filepath + '/*.txt', recursive=True):

    print(file)

    data=pd.read_csv(file, sep ='\t',header=0, skiprows=0, skipfooter=0, na_values=(""," ","NA"))
    dataframe_list.append(data)

data_merge = pd.merge(dataframe_list[0], dataframe_list[1], on='X',how='inner')

Edit 2, more than two dataframes:

dataframe_list = []
for file in glob.iglob(filepath + '/*.txt', recursive=True):

    print(file)

    data=pd.read_csv(file, sep ='\t',header=0, skiprows=0, skipfooter=0, na_values=(""," ","NA"))
    dataframe_list.append(data)

    if len(dataframe_list) ==2:
        data_merge = pd.merge(dataframe_list[0], dataframe_list[1], on='X')
        dataframe_list = [data_merge]
dataframe = dataframe_list[0]

edited Feb 06 '20 at 16:16

answered Feb 06 '20 at 15:57

sawezo

305
1
8

Cannot do that. Everything in a for loop. Please check OP carefully. – Alexander Feb 06 '20 at 15:58
In that case I would make a list of dataframes, fill it with data from each file, then merge the contents of the list, like this: – sawezo Feb 06 '20 at 15:59
1

can you send a solution for that? – Alexander Feb 06 '20 at 16:01
what if I have 100 files in the folder ? we need to do dataframe_list[0], dataframe_list[1], dataframe_list[2], dataframe_list[3]....dataframe_list[99] ? – Alexander Feb 06 '20 at 16:06
I would just merge contents when the list is two items large, then reset the list. added in edit – sawezo Feb 06 '20 at 16:17
1

Thanks for your solution. It really helped! – Alexander Feb 06 '20 at 16:25

Merge dataframes via columnwise in a directory with common columns

2 Answers2