1

I have 25 data frames which I need to merge and find recurrently occurring rows from all 25 data frames, For example, my data frame looks like following,

df1
chr start   end     name
1   12334   12334   AAA
1   2342    2342    SAP
2   3456    3456    SOS
3   4537    4537    ABR
df2
chr start   end     name
1   12334   12334   DSF
1   3421    3421    KSF
2   7689    7689    LUF
df3 
chr start   end     name
1   12334   12334   DSF
1   3421    3421    KSF
2   4537    4537    LUF
3   8976    8976    BAR
4   6789    6789    AIN

And In the end, I am aiming to have an output data frame like following,

chr start   end     name    Sample
1   12334   12334   AAA df1
1   12334   12334   AAA df2
1   12334   12334   AAA df3

I can get there with the following solution, By dictionary which adds all these three data frames into one bigger data frame dfs

dfs = {'df1': df1, 'df2': df2}

Then further,

common_tups = set.intersection(*[set(df[['chr', 'start', 'end']].drop_duplicates().apply(tuple, axis=1).values) for df in dfs.values()])
pd.concat([df[df[['chr', 'start', 'end']].apply(tuple, axis=1).isin(common_tups)].assign(Sample=name) for (name, df) in dfs.items()])

This gives out the resulting data frame with matching rows from all three data frames, but I have 25 data frames which I am calling as list from the directory as following,

path         = 'Fltered_vcfs/' 
files        = os.listdir(path)
results      = [os.path.join(path,i) for i in files if i.startswith('vcf_filtered')]

And so how can I show the list 'results' in the dictionary and proceed further to get the desired output. Any help or suggestions are greatly appreciated.

Thank you

ARJ
  • 2,021
  • 4
  • 27
  • 52

1 Answers1

1

Using the glob module, you can use

import os
from glob import glob

path = 'Fltered_vcfs' 
f_names = glob(os.path.join(path, 'vcf_filtered*.*')) 

Then, your dictionary can be created with dictionary comprehension using

import pandas as pd

 {os.path.splitext(os.path.split(f_name)[1])[0]: pd.read_csv(f_name,sep='\t') for f_name in f_names}
Community
  • 1
  • 1
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • There is a problem, my actual datafarmes has more columns, say like 18 for each of them. – ARJ Sep 26 '16 at 12:13
  • @user1017373 Not sure what the problem is. If you'd like to load only a specific subset of the columns, you can use [`usecols=['start', 'end', 'name']`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). Is that your problem? – Ami Tavory Sep 26 '16 at 12:18
  • No, loading whole columns and then merging based on matching rows from those three columns : [['chr', 'start', 'end']., the question is same, but the input slightly differs, and with the above solution its throwing index error message even after stripping the whitespace for all data frames as, KeyError: "['chr' 'start' 'end'] not in index" – ARJ Sep 26 '16 at 13:13
  • @user1017373 But then it looks like they're just not in the index, no? So it doesn't look like your input matches the the specifications of your algorithm. – Ami Tavory Sep 26 '16 at 13:15
  • Actually the name of the column is different from what I specify in the demo algorithm and that I have changed when I am using the same for real data frames. Like the real dfs have columns which I needed are as , chrom, start and end – ARJ Sep 26 '16 at 13:23
  • Yes, Now I understand the mistake, I had to mention the separator there, as , {os.path.splitext(os.path.split(f_name)[1])[0]: pd.read_csv(f_name,sep='\t') for f_name in f_names} and now its working Thank you so much..!!! – ARJ Sep 26 '16 at 14:53
  • Is it possible with the same algorithm to print out those rows repeated just in more than one dataframe – ARJ Sep 26 '16 at 15:33