0

I have 2 dataframe. Now, I want to get the common and uncommon rows of this two dataframe.

Here, common rows can be vice versa. This means I am searching for common between the two datasets that can have changed in the order of matching columns. For example, row number 2 of the df_1 is ID_2 ID_3 and df_2 is ID_3 ID_2. They are not common in terms of column value but they are common if we do not consider the column value.

After getting the common and uncommon I want to add another column label as 1 for common and 0 for uncommon.

The code I am using

result <- df_2 %>% 
  dplyr::inner_join(df_1, by = c("nodeA" = "query", "nodeB" = "target")) %>% 
  dplyr::mutate(GROUP = 1) %>% 
  dplyr::union(df_2 %>% 
                 dplyr::inner_join(df_1, by = c("nodeB" = "query", "nodeA" = "target")) %>% 
                 dplyr::mutate(GROUP = 2)) %>%
  mutate(label=1)

The output I am getting

  nodeA nodeB new_ssp  ssp GROUP label
1  ID_1  ID_2    0.50 0.50     1     1
2  ID_3  ID_4    0.80 0.80     1     1
3  ID_3  ID_2    0.90 0.90     2     1
4  ID_1  ID_6    0.09 0.09     2     1

Expected output (here, ? means I don't know. It can be 1 or 2 does not matter)

  nodeA nodeB new_ssp  ssp GROUP label
1  ID_1  ID_2    0.50 0.50     1     1
2  ID_3  ID_4    0.80 0.80     1     1
3  ID_3  ID_2    0.90 0.90     2     1
4  ID_1  ID_6    0.09 0.09     2     1
5  ID_4  ID_5    0.70 0.70     ?     0
6  ID_40  ID_50    0.70 0.70     ?     0
7  ID_5  ID_3    0.50 0.50     ?     0
8  ID_15  ID_31    0.50 0.50     ?     0

Reproducible Data

query <- c("ID_1", "ID_2", "ID_3", "ID_4", "ID_5", "ID_6")
target <- c("ID_2", "ID_3", "ID_4", "ID_5", "ID_3", "ID_1")
ssp <- c(0.5, 0.9, 0.8, 0.7, 0.5, 0.09)
df_1 <- data.frame(query, target, ssp)


nodeA <- c("ID_1", "ID_3", "ID_3", "ID_40", "ID_15", "ID_1")
nodeB <- c("ID_2", "ID_2", "ID_4", "ID_50", "ID_31", "ID_6")
new_ssp <- c(0.5, 0.9, 0.8, 0.7, 0.5, 0.09)
df_2 <- data.frame(nodeA, nodeB, new_ssp)
0Knowledge
  • 747
  • 3
  • 14

2 Answers2

0

I'm not 100% sure I understood your question but it seems that you want a composite identifier rather than 2 separate identifiers.

A way to get this would be to create a unique identifier based on the two others, by pasting them after alphabetically sorting.

This is explained in https://stackoverflow.com/a/63698432/3888000.

Then, you just have to use a full_join() and create your label based on the missing values (which you can coalesce if you want).

Here is some code:

library(tidyverse)

df_1_bis = df_1 %>% 
  mutate(col1 = pmin(query, target), col2 = pmax(query, target),
  id = paste(col1, col2, sep="__")) %>% 
  select(-col1, -col2)
df_2_bis = df_2 %>% 
  mutate(col1 = pmin(nodeA, nodeB), col2 = pmax(nodeA, nodeB), 
         id = paste(col1, col2, sep="__")) %>% 
  select(-col1, -col2)


df_2_bis %>% 
  dplyr::full_join(df_1_bis, by = "id") %>% 
  select(id, ssp, new_ssp) %>% 
  mutate(label=ifelse(is.na(ssp) | is.na(new_ssp), "uncommon", "common")) %>% 
  separate(id, into=c("nodeA", "nodeB"), sep="__")
#>   nodeA nodeB  ssp new_ssp    label
#> 1  ID_1  ID_2 0.50    0.50   common
#> 2  ID_2  ID_3 0.90    0.90   common
#> 3  ID_3  ID_4 0.80    0.80   common
#> 4  ID_4  ID_5   NA    0.70 uncommon
#> 5  ID_3  ID_5   NA    0.50 uncommon
#> 6  ID_1  ID_6 0.09    0.09   common
#> 7 ID_40 ID_50 0.70      NA uncommon
#> 8 ID_15 ID_31 0.50      NA uncommon

Created on 2021-03-31 by the reprex package (v1.0.0)

Dan Chaltiel
  • 7,811
  • 5
  • 47
  • 92
  • Thank you for your answer. But, I need the output in the different cells. Like `nodeA nodeB` not in a single cell. Because I have to do something else in later! – 0Knowledge Mar 31 '21 at 16:03
  • @0Knowledge well, you can adjust the `select()` call wherever you need – Dan Chaltiel Mar 31 '21 at 16:05
  • I didn't get you. Would you mind explaining or updating the answer? – 0Knowledge Mar 31 '21 at 16:09
  • @0Knowledge sorry, using select was a bit tedious, but you can `separate()` the id column. I updated my answer. However, as in your example, the information of whether it was originally from nodeA or nodeB is lost. – Dan Chaltiel Mar 31 '21 at 16:14
  • I am getting this error `Error in separate(., id, into = c("nodeA", "nodeB"), sep = "__") : could not find function "separate"` – 0Knowledge Mar 31 '21 at 16:25
  • 1
    Oops. I was using `dplyr` and received the error. But, now, I am using `tidyverse` and working correctly. After applying my main dataset, I will let you know. – 0Knowledge Mar 31 '21 at 16:27
  • 1
    @0Knowledge this is normal if you only use `dplyr`, either load `library(tidyverse)` or call the specific package `tidyr::separate()` – Dan Chaltiel Mar 31 '21 at 16:27
  • Thank you very much for your kind response and answer. – 0Knowledge Mar 31 '21 at 16:29
  • @0Knowledge you are welcome, it is nice to see good questions with reproducible data :-) if your problem is solved don't forget to accept an answer (mine or Damian's) – Dan Chaltiel Mar 31 '21 at 17:40
  • Your code is working but I have a question, why you used `nodeA nodeB` at the time of the `df_1` operation? But, my data frame contains `query and target` for `df_1`. – 0Knowledge Apr 01 '21 at 14:46
  • @0Knowledge oh this was a mistake! It only worked because your example is based on variables with the same name. You might want to use `dput()` to share them next time, it might be easier. I corrected my answer – Dan Chaltiel Apr 01 '21 at 19:18
0

There might be specialized commands for this, but a simple way to do it would be to create a single vector out of the columns using paste(), then use set operators to do your comparison logic.

Edit: updating the example to create a new column in the output as noted in the post, and using a function to encapsulate the paste command to minimize code duplication.

Example Data

A <- tibble(
    x = c(1, 0),
    y = c(2, 3)
)

B <- tibble(
    x = c(0, 1, 0, 1),
    y = c(1, 2, 3, 4)
) 

# function to create vector for comparisons
key <- function(data) {
    data %>% 
        mutate(key = paste(x, y, sep = ",")) %>% 
        pull(key)
}

Using it on the first data frame

A %>% 
    mutate(label = if_else(key(.) %in% key(B), "in common", "different"))

Result

A tibble: 2 x 3
      x     y label    
  <dbl> <dbl> <chr>    
1     1     2 in common
2     0     3 in common

Using it on the second data frame

B %>% 
    mutate(label = if_else(key(.) %in% key(A), "in common", "different"))

Result

# A tibble: 4 x 3
      x     y label    
  <dbl> <dbl> <chr>    
1     0     1 different
2     1     2 in common
3     0     3 in common
4     1     4 different
Damian
  • 1,385
  • 10
  • 10