Merging dataframes by file name

Question

I have multiple files with the following naming convention.

ENCSR000EQO_0_0.txt
ENCSR000DIA_0_0.txt
ENCSR000DIA_1_1.txt
ENCSR000DIA_2_1.txt
ENCSR000DIM_0_0.txt
ENCSR000DIM_1_1.txt
ENCSR000AIB_0_0.txt
ENCSR000AIB_1_1.txt
ENCSR000AIB_2_1.txt
ENCSR000AIB_3_1.txt

I want to merge them as dataframes using pandas according to the file name, so I would have 4 resulting dataframes. And then for each of these 4, I want to groupby the gene(GeneName) column. Since the same gene will appear multiple times.

They all have the same columns in the same order. I can merge all 10 together at once, but I couldn't figure it out how to merge by name.

path = '/renamed/'

print os.listdir(path)

df_merge = None
for fname in os.listdir(path):
    if fname.endswith('.txt'):
        df = pd.read_csv(path + fname, sep='\t', header=0)
        df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size', 
                     'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID', 
                     'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias', 
                     'GeneDescription', 'GeneType']
        df = df.groupby('GeneName').agg(np.mean)
        print df

Thank you for any input.

I think you want something called a `join`. Look up `pd.merge`. — Demetri Pananos, Dec 20 '16 at 20:18
Maybe [this post](http://stackoverflow.com/questions/23668427/pandas-joining-multiple-dataframes-on-columns) will be helpful. — lmo, Dec 20 '16 at 20:46

Alex · Accepted Answer · 2016-12-20T20:57:06.193

I would do something more like this, where you can use glob to get the filenames, check each one, and then group the concatenated results.

import glob
path = 'renamed'

df_merge = None
for fid in ('EQO', 'DIA', 'DIM', 'AIB'):
    df_ = pd.DataFrame()
    for fname in glob.glob(os.path.join(path, '*.txt')):
        if fid in fname:
            df = pd.read_csv(fname, sep='\t', header=0)
            df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size', 
                         'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID', 
                         'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias', 
                         'GeneDescription', 'GeneType']
            df_ = pd.concat((df_, df))
    df_ = df_.groupby('GeneName').agg(np.mean)
    print df_

Edit: expanding answer to be more automated.

Based on your filenames you might be able to id them as follows:

import numpy as np
files = glob.glob(os.path.join(path, '*.txt'))
fids = np.unique([file.split('_')[0] for file in files])

Putting it all together the updated code would be this:

import glob
import numpy as np
path = 'renamed'

files = glob.glob(os.path.join(path, '*.txt'))
fids = np.unique([file.split('_')[0] for file in files])

df_merge = None
for fid in fids:
    df_ = pd.DataFrame()
    for fname in files:
        if fid in fname:
            df = pd.read_csv(fname, sep='\t', header=0)
            df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size', 
                         'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID', 
                         'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias', 
                         'GeneDescription', 'GeneType']
            df_ = pd.concat((df_, df))
    df_ = df_.groupby('GeneName').agg(np.mean)
    print df_

I'm not familiar with glob. It seems like an interesting possibility, but in this case I actually have 100000 files and a lot of file name variation. — Carol M, Dec 20 '16 at 20:34
Glob will still be helpful and it's part of the standard library. Please see my expanded answer. — Alex, Dec 20 '16 at 20:52
for performance, it's often beneficial to store all of the DFs in a list and postpone calling `concat` until the end. — Paul H, Dec 20 '16 at 21:06
Indeed It worked really well. Specially performance wise. It's already done! I'm amazed. Thank you. — Carol M, Dec 20 '16 at 21:20

Ezer K · Answer 2 · 2016-12-20T20:25:04.740

try adding the file name as column, append all df's to a list and concat them, then group:

df_merge = []
for fname in os.listdir(path):
if fname.endswith('.txt'):
    df = pd.read_csv(path + fname, sep='\t', header=0)
    df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak     Score', 'Focus Ratio/Region Size', 
                 'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID', 
                 'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias', 
                 'GeneDescription', 'GeneType']
    df['fname'] = [fname.split('_')[0] for x in df.index] #just to multiple by length 
    df_merge.append(df)

df_all = pd.concat(df_merge)

for fn in set(df_all['fname'].values):    
    print df_all[df_all['fname']==fn].groupby('GeneName').agg(np.mean)

This is a brilliant solution. Why did I never thought of that? hehe But it takes a lot of my computer. — Carol M, Dec 20 '16 at 21:16

Merging dataframes by file name

2 Answers2