1

I want to find the common rows between 2 dataframe. To find the common rows, I can use inner_join(), semi_join(), and merge(). I have gone through different posts including this. But, these operations are not fulfilling my purposes. Because my data in the dataframe is a little different!

Sometimes, the data in the dataframe can be vise versa. Like the 3rd and 5th rows of dataframe-1 and dataframe-2. Dataframe-1 contains A3 A1 0.75 but Dataframe-2 contains A1 A3 0.75 . I would like to take these 2 rows as the same.

My first dataframe looks like

  query target weight
1    A1     A2   0.60
2    A2     A5   0.50
3    A3     A1   0.75
4    A4     A5   0.88
5    A5     A3   0.99
6    (+)-1(10),4-Cadinadiene     Falcarinone-10     0.09
7    Leucodelphinidin-100    (+)-1(10),4-Cadinadiene     0.876
8    Lignin  (2E,7R,11R)-2-Phyten-1-ol   0.778
9    (2E,7R,11R)-2-Phyten-1-ol   Leucodelphinidin    0.55
10   Falcarinone     Lignin  1
11   A1  (+)-1(10),4-Cadinadiene     1
12   A2  Lignin-10  1
13   A3  (2E,7R,11R)-2-Phyten-1-ol   1
14   Falcarinone  A6    1
15   A4  Leucodelphinidin    1
16   A4  Leucodelphinidin    1
17   Falcarinone  A100    1
18   A4  Falcarinone     1

the second dataframe looks like

  query target
1    A1     A2   
2    A2     A5   
3    A1     A3  // Missing in the output
4    A4     A5   
5    A3     A5  // Missing in the output
6    A3  (2E,7R,11R)-2-Phyten-1-ol   
7    (+)-1(10),4-Cadinadiene     Falcarinone    
8    Leucodelphinidin    (+)-1(10),4-Cadinadiene-100    
9    Lignin-2  (2E,7R,11R)-2-Phyten-1-ol   
10   A11  (+)-1(10),4-Cadinadiene    
11   A2  Lignin  
12   A3  (2E,7R,11R)-2-Phyten-1-0l 
13   Falcarinone  A60    
14   A4  Leucodelphinidin  // Missing in the output

The code I am using output <- semi_join(Dataframe-1, Dataframe-2) OR output <- inner_join(df_only_dd, sample_data_dd_interaction)

The output I am getting

  query target weight
1    A1     A2   0.60
2    A2     A5   0.50

But, my expected output is like this

  query target weight
1    A1     A2   0.60
2    A2     A5   0.50
3    A3     A1   0.75
4    A4     A5   0.88
5    A5     A3   0.99
6    A4  Leucodelphinidin  1

Reproducible code is given below

df_1 <- read.table(text="query   target     weight
A1  A2  0.6
A2  A5  0.5
A3  A1  0.75
A4  A5  0.88
A5  A3  0.99
(+)-1(10),4-Cadinadiene     Falcarinone     0.09
Leucodelphinidin    (+)-1(10),4-Cadinadiene     0.876
Lignin  (2E,7R,11R)-2-Phyten-1-ol   0.778
(2E,7R,11R)-2-Phyten-1-ol   Leucodelphinidin    0.55
Falcarinone     Lignin  1
A1  (+)-1(10),4-Cadinadiene     1
A2  Lignin  1
A3  (2E,7R,11R)-2-Phyten-1-ol   1
Falcarinone  A6    1
A4  Leucodelphinidin    1
A4  Leucodelphinidin    1
Falcarinone  A100    1
A5  Falcarinone     1", header=TRUE)
df_2 <- read.table(text="query   target
A1  A2 
A2  A5
A1  A3  
A4  A5  
A3  A5  
(+)-1(10),4-Cadinadiene     Falcarinone    
Leucodelphinidin    (+)-1(10),4-Cadinadiene-100    
Lignin-2  (2E,7R,11R)-2-Phyten-1-ol   
A11  (+)-1(10),4-Cadinadiene    
A2  Lignin  
A3  (2E,7R,11R)-2-Phyten-1-0l 
Falcarinone  A6    
A4  Leucodelphinidin  ", header=TRUE)

Any kind of suggestion is appreciated.

0Knowledge
  • 747
  • 3
  • 14
  • isn't your expected output the same as your `df_1 `? – MarBlo Jan 03 '21 at 07:01
  • @MarBlo yes. Thanks for the comment. It seems that I made confused after using the same kind of dataframe. However, now I made changes in my dataframe. Please, look now and give your suggestions. – 0Knowledge Jan 03 '21 at 08:42

2 Answers2

1

maybe can try:

output <- merge(df_1, df_2, all=T)

and then check for duplicated rows regardless of ordering, smthing like:

same.rows <- duplicated(t(apply(output, 1, sort)))

which returnds a vector of flags

FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

you can then keep the rows which are FALSE

output[which(same.rows==F),]


query target weight
1    A1     A2   0.60
2    A1     A3   0.75
3    A2     A5   0.50
5    A3     A5   0.99
6    A4     A5   0.88

does it make sense?

efz
  • 425
  • 4
  • 9
  • Thank you very much. But, your code is not working for me. This is actually my fault. Because my actual dataframe contains different kinds of data and I gave a little portion (more specifically) similar kinds of data in my previous post. However, please look at it now and give your suggestion. – 0Knowledge Jan 03 '21 at 08:46
  • hi, still the code sample seems uncomplete. the solution I gave still works fine and I get the set of flags ````FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE````, that is it identifies two rows being similar: (A1, A3, 075) and (A3,A5, 0.99). maybe provide a longer darta sample? – efz Jan 03 '21 at 09:03
  • please check now. I have updated the dataframe and let me know. – 0Knowledge Jan 03 '21 at 09:26
  • After applying your code in the datframe given in the question I am getting only 1 output `(+)-1(10),4-Cadinadiene Falcarinone 0.09` – 0Knowledge Jan 03 '21 at 09:30
  • Does it mean, df_1 contains `3 columns` and df_2 contains `2 columns` and that's why the code is not working perfectly? – 0Knowledge Jan 03 '21 at 09:51
  • yeah well, it tries to match all columns, so also df2 shall contains three columns. try with ````same.rows <- duplicated(output, fromLast = T)````. does it do what you-d expect? – efz Jan 03 '21 at 09:59
1

You could write a small function that sorts rows of first two columns of both data frames, then merge them.

sc <- function(x, i) setNames(cbind(data.frame(t(apply(x[i], 1, sort))), x[-i]), names(x))

res <- merge(sc(df_1, 1:2), sc(df_2, 1:2))
res[!duplicated(res), ]  ## remove duplicates
#                      query           target weight
# 1  (+)-1(10),4-Cadinadiene      Falcarinone   0.09
# 2                       A1               A2   0.60
# 3                       A1               A3   0.75
# 4                       A2               A5   0.50
# 5                       A2           Lignin   1.00
# 6                       A3               A5   0.99
# 7                       A4               A5   0.88
# 8                       A4 Leucodelphinidin   1.00
# 10                      A6      Falcarinone   1.00

Edit

Solution with data.table which should be more memory efficient.

library(data.table)
setDT(df_1)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
setDT(df_2)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
res <- merge(df_1[!duplicated(df_1),], df_2, allow.cartesian=TRUE)
res
#                      query           target weight
# 1: (+)-1(10),4-Cadinadiene      Falcarinone   0.09
# 2:                      A1               A2   0.60
# 3:                      A1               A3   0.75
# 4:                      A2               A5   0.50
# 5:                      A2           Lignin   1.00
# 6:                      A3               A5   0.99
# 7:                      A4               A5   0.88
# 8:                      A4 Leucodelphinidin   1.00
# 9:                      A6      Falcarinone   1.00

To get back "data.frame"s, just do e.g. setDF(res).

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thank you very much. Your code is working perfectly for the demo data given in the post. But when I am using it for my real data it is showing `Error: cannot allocate vector of size 49.8 Gb`. However, the size of the df_1 is `1272464 bytes, 65893 objects, and 3 variables` and df_2 is `3507976 bytes, 202732 objects, and 2 variables`. I am not sure, how these 2 small df making `94.8 Gb`. Could you give me any suggestions, please? – 0Knowledge Jan 03 '21 at 14:31
  • Object operations usually make copies of the objects, i.e. when manipulating object A, say row-wise, at first a new empty object B is created and which then gets filled row by row. So the memory size already doubled. Now count the operations made in function `sc()` and you get an idea of what's happening. You could try to use `sc()` first on the two data frames, remove `df_1/2` from your workspace, then putting them into `merge`. Alternatively you could look for a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) solution which are known for not making copies. – jay.sf Jan 03 '21 at 15:08
  • I tried to use the `sc()` function for df_1 and df_2. The function worked perfectly. Then I removed all `objects` from the workspace. Finally, applied the `merge` function and got the same error. Bad luck.! – 0Knowledge Jan 03 '21 at 15:40
  • @Akib62 I came up with a data.table solution, see Edit. I'm curious, does it work for you? – jay.sf Jan 03 '21 at 15:48
  • `Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 2880332 rows; more than 255736 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.` – 0Knowledge Jan 03 '21 at 16:00
  • Try `merge(df_1[!duplicated(df_1),], df_2, allow.cartesian=TRUE)`, Ref.:https://stackoverflow.com/q/36349613/6574038 – jay.sf Jan 03 '21 at 16:54