I have 2 CSV files. Now I want to find the common rows of these 2 files. So, after reading them as dataframe
I converted them as datatable
and then merge
them. But, somehow, my code is not working. After using setDT()
my dataset is changed and I am not getting any common rows between them!
Before running my dataset
nodeA nodeB scr
1 ID08918 ID04896 1
2 ID00402 ID01198 1
3 ID00182 ID01576 1
4 ID06413 ID00745 1
5 ID00215 ID01175 1
6 ID00448 ID05351 1
7 ID00860 ID00959 0.996197718631179
8 ID01110 ID01127 0.99604743083004
9 ID00497 ID01192 0.995436766623207
10 ID00877 ID01590 0.993939393939394
11 ID01192 ID01183 0.992202729044834
12 ID00361 ID00570 0.988354430379747
13 ID01045 ID01201 0.98766954377312
14 ID11641 ID00541 0.986875315497224
15 ID11641 ID00570 0.98685540950455
16 ID00458 ID01151 0.986813186813187
17 ID00199 ID01211 0.981416957026713
18 ID00570 ID00309 0.981151299032094
19 ID00541 ID00309 0.978161503301168
20 ID00603 ID06789 0.977272727272727
library(dplyr)
df_1 <- read.csv("~/df_1.csv", stringsAsFactors = FALSE)
df_2 <- read.csv("~/df_2.csv", stringsAsFactors = FALSE)
library(data.table)
setDT(df_1)[,c("nodeA", "nodeB") := list(pmin(nodeA,nodeB), pmax(nodeA,nodeB))]
setDT(df_2)[,c("nodeA", "nodeB") := list(pmin(nodeA,nodeB), pmax(nodeA,nodeB))]
result <- merge(df_1[!duplicated(df_1),], df_2, allow.cartesian=TRUE)
After running the code my dataset is changed.
nodeA nodeB scr
1: ID08918 ID08918 1
2: ID00402 ID00402 1
3: ID00182 ID00182 1
4: ID06413 ID06413 1
5: ID00215 ID00215 1
6: ID00448 ID00448 1
7: ID00860 ID00860 0.996197718631179
8: ID01110 ID01110 0.99604743083004
9: ID00497 ID00497 0.995436766623207
10: ID00877 ID00877 0.993939393939394
11: ID01192 ID01192 0.992202729044834
12: ID00361 ID00361 0.988354430379747
13: ID01045 ID01045 0.98766954377312
14: ID11641 ID11641 0.986875315497224
15: ID11641 ID11641 0.98685540950455
16: ID00458 ID00458 0.986813186813187
17: ID00199 ID00199 0.981416957026713
18: ID00570 ID00570 0.981151299032094
19: ID00541 ID00541 0.978161503301168
20: ID00603 ID00603 0.977272727272727
Reproducible Dataset
df_1
structure(list(query = structure(c(18L, 5L, 1L, 17L, 3L, 6L,
12L, 15L, 8L, 13L, 16L, 4L, 14L, 19L, 19L, 7L, 2L, 10L, 9L, 11L
), .Label = c("ID00182", "ID00199", "ID00215", "ID00361", "ID00402",
"ID00448", "ID00458", "ID00497", "ID00541", "ID00570", "ID00603",
"ID00860", "ID00877", "ID01045", "ID01110", "ID01192", "ID06413",
"ID08918", "ID11641"), class = "factor"), target = structure(c(16L,
11L, 14L, 4L, 8L, 17L, 5L, 6L, 10L, 15L, 9L, 3L, 12L, 2L, 3L,
7L, 13L, 1L, 1L, 18L), .Label = c("ID00309", "ID00541", "ID00570",
"ID00745", "ID00959", "ID01127", "ID01151", "ID01175", "ID01183",
"ID01192", "ID01198", "ID01201", "ID01211", "ID01576", "ID01590",
"ID04896", "ID05351", "ID06789"), class = "factor"), new_ssp = structure(c(15L,
15L, 15L, 15L, 15L, 15L, 14L, 13L, 12L, 11L, 10L, 9L, 8L, 7L,
6L, 5L, 4L, 3L, 2L, 1L), .Label = c("0.977272727272727", "0.978161503301168",
"0.981151299032094", "0.981416957026713", "0.986813186813187",
"0.98685540950455", "0.986875315497224", "0.98766954377312",
"0.988354430379747", "0.992202729044834", "0.993939393939394",
"0.995436766623207", "0.99604743083004", "0.996197718631179",
"1"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
df_2
structure(list(nodeA = structure(c(4L, 2L, 1L, 1L, 1L, 4L, 1L,
9L, 3L, 4L, 2L, 8L, 2L, 1L, 5L, 7L, 3L, 6L, 2L, 1L), .Label = c("ID00309",
"ID00361", "ID00541", "ID00570", "ID00615", "ID00696", "ID00762",
"ID01200", "ID05109"), class = "factor"), nodeB = structure(c(8L,
3L, 3L, 1L, 2L, 7L, 9L, 8L, 8L, 6L, 9L, 7L, 4L, 4L, 6L, 9L, 6L,
7L, 5L, 5L), .Label = c("ID00361", "ID00541", "ID00570", "ID00615",
"ID00696", "ID01200", "ID05109", "ID11641", "ID11691"), class = "factor"),
scr = structure(20:1, .Label = c("1.85284606048794", "1.90444166064472",
"1.90762235378507", "1.94364188077133", "1.95883206119256",
"2.08440437841349", "2.26408172709962", "2.3223132020942",
"2.46120775935034", "2.49647215035727", "2.50432367561777",
"2.57541320006514", "2.65099330092281", "2.75209155741549",
"2.93717640337986", "2.99596628688011", "3.21209741517806",
"3.21997803385465", "3.48788394772132", "3.81389707587156"
), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
Note: I am also using dplyr
for some purposes like %>%
etc. Does it mean, dplyr and data.table
is conflicting somehow?