Why my dataset changed after running the setDT()

Question

I have 2 CSV files. Now I want to find the common rows of these 2 files. So, after reading them as dataframe I converted them as datatable and then merge them. But, somehow, my code is not working. After using setDT() my dataset is changed and I am not getting any common rows between them!

Before running my dataset

     nodeA  nodeB           scr
1  ID08918 ID04896                 1
2  ID00402 ID01198                 1
3  ID00182 ID01576                 1
4  ID06413 ID00745                 1
5  ID00215 ID01175                 1
6  ID00448 ID05351                 1
7  ID00860 ID00959 0.996197718631179
8  ID01110 ID01127  0.99604743083004
9  ID00497 ID01192 0.995436766623207
10 ID00877 ID01590 0.993939393939394
11 ID01192 ID01183 0.992202729044834
12 ID00361 ID00570 0.988354430379747
13 ID01045 ID01201  0.98766954377312
14 ID11641 ID00541 0.986875315497224
15 ID11641 ID00570  0.98685540950455
16 ID00458 ID01151 0.986813186813187
17 ID00199 ID01211 0.981416957026713
18 ID00570 ID00309 0.981151299032094
19 ID00541 ID00309 0.978161503301168
20 ID00603 ID06789 0.977272727272727

library(dplyr)

df_1 <- read.csv("~/df_1.csv", stringsAsFactors = FALSE)
df_2 <- read.csv("~/df_2.csv", stringsAsFactors = FALSE)

library(data.table)
setDT(df_1)[,c("nodeA", "nodeB") := list(pmin(nodeA,nodeB), pmax(nodeA,nodeB))]
setDT(df_2)[,c("nodeA", "nodeB") := list(pmin(nodeA,nodeB), pmax(nodeA,nodeB))]
result <- merge(df_1[!duplicated(df_1),], df_2, allow.cartesian=TRUE)

After running the code my dataset is changed.

      nodeA   nodeB               scr
 1: ID08918 ID08918                 1
 2: ID00402 ID00402                 1
 3: ID00182 ID00182                 1
 4: ID06413 ID06413                 1
 5: ID00215 ID00215                 1
 6: ID00448 ID00448                 1
 7: ID00860 ID00860 0.996197718631179
 8: ID01110 ID01110  0.99604743083004
 9: ID00497 ID00497 0.995436766623207
10: ID00877 ID00877 0.993939393939394
11: ID01192 ID01192 0.992202729044834
12: ID00361 ID00361 0.988354430379747
13: ID01045 ID01045  0.98766954377312
14: ID11641 ID11641 0.986875315497224
15: ID11641 ID11641  0.98685540950455
16: ID00458 ID00458 0.986813186813187
17: ID00199 ID00199 0.981416957026713
18: ID00570 ID00570 0.981151299032094
19: ID00541 ID00541 0.978161503301168
20: ID00603 ID00603 0.977272727272727

Reproducible Dataset df_1

structure(list(query = structure(c(18L, 5L, 1L, 17L, 3L, 6L, 
12L, 15L, 8L, 13L, 16L, 4L, 14L, 19L, 19L, 7L, 2L, 10L, 9L, 11L
), .Label = c("ID00182", "ID00199", "ID00215", "ID00361", "ID00402", 
"ID00448", "ID00458", "ID00497", "ID00541", "ID00570", "ID00603", 
"ID00860", "ID00877", "ID01045", "ID01110", "ID01192", "ID06413", 
"ID08918", "ID11641"), class = "factor"), target = structure(c(16L, 
11L, 14L, 4L, 8L, 17L, 5L, 6L, 10L, 15L, 9L, 3L, 12L, 2L, 3L, 
7L, 13L, 1L, 1L, 18L), .Label = c("ID00309", "ID00541", "ID00570", 
"ID00745", "ID00959", "ID01127", "ID01151", "ID01175", "ID01183", 
"ID01192", "ID01198", "ID01201", "ID01211", "ID01576", "ID01590", 
"ID04896", "ID05351", "ID06789"), class = "factor"), new_ssp = structure(c(15L, 
15L, 15L, 15L, 15L, 15L, 14L, 13L, 12L, 11L, 10L, 9L, 8L, 7L, 
6L, 5L, 4L, 3L, 2L, 1L), .Label = c("0.977272727272727", "0.978161503301168", 
"0.981151299032094", "0.981416957026713", "0.986813186813187", 
"0.98685540950455", "0.986875315497224", "0.98766954377312", 
"0.988354430379747", "0.992202729044834", "0.993939393939394", 
"0.995436766623207", "0.99604743083004", "0.996197718631179", 
"1"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))

df_2

structure(list(nodeA = structure(c(4L, 2L, 1L, 1L, 1L, 4L, 1L, 
9L, 3L, 4L, 2L, 8L, 2L, 1L, 5L, 7L, 3L, 6L, 2L, 1L), .Label = c("ID00309", 
"ID00361", "ID00541", "ID00570", "ID00615", "ID00696", "ID00762", 
"ID01200", "ID05109"), class = "factor"), nodeB = structure(c(8L, 
3L, 3L, 1L, 2L, 7L, 9L, 8L, 8L, 6L, 9L, 7L, 4L, 4L, 6L, 9L, 6L, 
7L, 5L, 5L), .Label = c("ID00361", "ID00541", "ID00570", "ID00615", 
"ID00696", "ID01200", "ID05109", "ID11641", "ID11691"), class = "factor"), 
    scr = structure(20:1, .Label = c("1.85284606048794", "1.90444166064472", 
    "1.90762235378507", "1.94364188077133", "1.95883206119256", 
    "2.08440437841349", "2.26408172709962", "2.3223132020942", 
    "2.46120775935034", "2.49647215035727", "2.50432367561777", 
    "2.57541320006514", "2.65099330092281", "2.75209155741549", 
    "2.93717640337986", "2.99596628688011", "3.21209741517806", 
    "3.21997803385465", "3.48788394772132", "3.81389707587156"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))

Note: I am also using dplyr for some purposes like %>% etc. Does it mean, dplyr and data.table is conflicting somehow?

I doubt it is the setDT that changes it. It would be the following command using `:= ` which does assignment by reference. This is expected behaviour. BTW, you should use `fread` to read the csv file, then no need to convert it using setDT. You don't need to use dplyr to get piping. Just use library(magrittr) for that. Or use data.table piping with `[` — dww, Feb 21 '21 at 15:13
I’m voting to close this question because this is expected behaviour of the `:=` operator (assignment by reference), which changes the data within in the original data.table — dww, Feb 21 '21 at 15:17
@dww thank you for your ans. BTW, I have found this ans from this post [link](https://stackoverflow.com/questions/65547240/how-to-find-common-rows-considering-vice-versa-format-of-2-dataframe-in-r) and the code was working. — 0Knowledge, Feb 21 '21 at 15:26
there's no reason it won't work. Your question was why does it change the original data. The reason is that you used `:=`, which is expected to have this desired behaviour. If you dont want to overwrite by refernce, then dont use `:=`. Something like this instead: `result <- merge(df_1[, .(pmin(query,target), pmax(query,target))][, .SD[!duplicated(.SD)]], df_2[, .(pmin(nodeA,nodeB), pmax(nodeA,nodeB))], allow.cartesian=TRUE)` — dww, Feb 21 '21 at 16:20

DPH · Accepted Answer · 2021-02-21T19:50:43.387

-2

one possible solution with dplyr, inner_join and union from dplyr:

    # inner join
df_2 %>% 
  dplyr::inner_join(df_1, by = c("nodeA" = "query", "nodeB" = "target")) %>% 
  dplyr::mutate(GROUP = 1) %>% 
  dplyr::union(df_2 %>% 
                  dplyr::inner_join(df_1, by = c("nodeB" = "query", "nodeA" = "target")) %>% 
                  dplyr::mutate(GROUP = 2)) 

   nodeA   nodeB              scr           new_ssp GROUP
1 ID00361 ID00570 3.48788394772132 0.988354430379747     1
2 ID00570 ID11641 3.81389707587156  0.98685540950455     2
3 ID00309 ID00570 3.21997803385465 0.981151299032094     2
4 ID00309 ID00541 2.99596628688011 0.978161503301168     2
5 ID00541 ID11641 2.57541320006514 0.986875315497224     2

edited Feb 21 '21 at 19:50

answered Feb 21 '21 at 15:27

DPH

4,244
1
8
18

Thank you for your ans. But, the problem is, `join` function is works only in one direction. But I need to match from both directions. For example, 1 row in the df_1 is in this way (ID1, ID2) and df_2 contains in this way (ID2, ID1). Then I want to count them as `matched`. I also do not want to use `new_ssp or scr` for matching only `nodeA, nodeB` – 0Knowledge Feb 21 '21 at 15:34
@Akib62 you can drop delete the "scr" = "new_ssp" and concering the order it is possible to invert the join and combine via union ... I will edit my answer acordingly – DPH Feb 21 '21 at 15:42
Very interesting and nice solution. But could you tell me what is the purpose of `mutate(GROUP = 2)` I understand, we are doing `mutate` but what is `2`? – 0Knowledge Feb 21 '21 at 16:28
Most probably, I am getting the wrong answer because of `union`. I do not want all. I just want the matched from any direction! After running your code, I am getting some rows that are actually not matched with the `df_1` and `df_2`. – 0Knowledge Feb 21 '21 at 16:42
the GROUP = just generates an id to know from which join operation the match resulted. I had a logical error in my answer and altered it (not my day today aperently)... Possibly I am not fully understanding you problem? I am searching for duplicates between the two datasets that can have change in the order of mathcing columns. Which means, that one answer might show up more than once due to the distinct order of matching columns. If the value of new_ssp and scr are not influenced by the column order/dataset you could use "unique()" from base or dplyr::distinct to reduce the result set further – DPH Feb 21 '21 at 19:57
so, do I have to apply `unique()` or `distinct` to check the duplicate? We are already doing `group by`! – 0Knowledge Feb 22 '21 at 02:37
I am getting a full half duplicate of the output. For example, if the output of the code is 10 then actually they are 5 (the rest of the 5 are duplicates)! – 0Knowledge Feb 22 '21 at 03:02
@Akib62 if "duplicates" refer to only the combination of nodeA and nodeB then you can add an aditional line to the pipe: dplyr::distinct(nodeA, nodeB) => the line before it has to end with %>% (as the others)... optionally you could do "select(nodeA, nodeB) %>% unique()" – DPH Feb 22 '21 at 09:59

Why my dataset changed after running the setDT()

1 Answers1