1

I have multiple data frames (25 dataframes), I am looking for recurrently occuuring row values from three columns of all dataframes. The following are my example of my daframes

df1
chr start   end     name
1   12334   12334   AAA
1   2342    2342    SAP
2   3456    3456    SOS
3   4537    4537    ABR
df2
chr start   end     name
1   12334   12334   DSF
1   3421    3421    KSF
2   7689    7689    LUF
df3 
chr start   end     name
1   12334   12334   DSF
1   3421    3421    KSF
2   4537    4537    LUF
3   8976    8976    BAR
4   6789    6789    AIN

And finally what I am aiming is to look into fist three columns of these daframe and extract a new dataframe based on matching rows from these 3 column values along with names of datafrme as last column. So the final data frame should look like this,

chr start   end     name    Sample
1   12334   12334   AAA df1
1   12334   12334   AAA df2
1   12334   12334   AAA df3

I know the following line of python script will create the above output without Sample as a column.

s1 = pd.merge(df1, df2, how='left', on=['chr', 'start', 'end'])
df_final = pd.merge(s1, df3[['chr', 'start', 'end']], how='left', on=['chr', 'start','end'])

but I have more than 25 dataframe which I need to look for merge based on matching values. Any robust and better solution would be really appreciated

benten
  • 1,995
  • 2
  • 23
  • 38
ARJ
  • 2,021
  • 4
  • 27
  • 52

1 Answers1

1

Say you have a dictionary mapping sample names to DataFrames:

dfs = {'df1': df1, 'df2': df2}

(and so on).

The common relevant keys (in hashable form) are

common_tups = set.intersection(*[set(df[['chr', 'start', 'end']].drop_duplicates().apply(tuple, axis=1).values) for df in dfs.values()])

Now you just need, for each DataFrame, to find the relevant rows, add the name of the DataFrame as the sample, and concatenate the results:

pd.concat([df[df[['chr', 'start', 'end']].apply(tuple, axis=1).isin(common_tups)].assign(Sample=name) for (name, df) in dfs.items()])
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Thanks for the solution, but i have following error, KeyError: "['end'] not in index", the end column is not index – ARJ Sep 23 '16 at 14:58
  • @user1017373 Which line causes it? – Ami Tavory Sep 23 '16 at 14:59
  • The set intersection, common_tups = set.intersection(*[set(df[['chr', 'start', 'end']].drop_duplicates().apply(tuple, axis=1).values) for df in dfs.values()]) – ARJ Sep 23 '16 at 15:00
  • @user1017373 OK. Could you please check that each DataFrame actually has an `end` column? I strongly suspect at least one doesn't. – Ami Tavory Sep 23 '16 at 15:01
  • @user1017373 Specifically, can you print out `[name for (name, df) in dfs.items() if 'end' not in df.columns]`? – Ami Tavory Sep 23 '16 at 15:04
  • I have end for all three test dataframes.. here is what it prints out with the above script, ['df1', 'df2', 'df3'] – ARJ Sep 23 '16 at 15:06
  • @user1017373 The line I asked you to print actually printed the DataFrames *not* containing `end` in the columns. You have something that looks like `'end'`; maybe something like `'end '` (note the spaces). – Ami Tavory Sep 23 '16 at 15:09
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/124058/discussion-between-user1017373-and-ami-tavory). – ARJ Sep 23 '16 at 16:14
  • What if the list of data frame is called by using os.path from the directory how then can the dictionary works ? – ARJ Sep 26 '16 at 11:14
  • @user1017373 That's not very difficult, but it's a separate question. It's difficult to have a technical discussion in the comments section. Could you ask, in a new question, how to do create a dictionary of DataFrames through glob? I'm sure people will help you out. If you want, place here a link, and I'll be happy to look at it. – Ami Tavory Sep 26 '16 at 11:21
  • Yes, I posted it as new question ..http://stackoverflow.com/questions/39701779/mapping-multiple-dataframe-based-on-the-matching-columns – ARJ Sep 26 '16 at 11:36