merge based on an id with missing values and string

Question

my df is shown below

mydf<- structure(list(IDs = c(11L, 16L, 19L, 21L, 22L, 24L, 42L, 43L, 
51L), string1 = structure(c(1L, 8L, 7L, 2L, 4L, 9L, 6L, 3L, 5L
), .Label = c("b", "g", "hue", "hyu", "if", "jud", "ufhy", "uhgf;ffugf", 
"uhgs"), class = "factor"), IDs.1 = c(4L, 11L, 16L, 19L, 20L, 
22L, 29L, NA, NA), string2 = structure(c(2L, 3L, 8L, 7L, 4L, 
5L, 6L, 1L, 1L), .Label = c("", "a", "b", "higf;hdugd", "hyu", 
"inja", "ufhy", "uhgf;ffugf"), class = "factor")), .Names = c("IDs", 
"string1", "IDs.1", "string2"), class = "data.frame", row.names = c(NA, 
-9L))

I want to get them together like below

myout<- structure(list(Ids = c(4L, 11L, 16L, 19L, 20L, 21L, 22L, 24L, 
29L, 42L, 43L, 51L), string = structure(c(1L, 2L, 11L, 10L, 4L, 
3L, 6L, 12L, 8L, 9L, 5L, 7L), .Label = c("a", "b", "g", "higf;hdugd", 
"hue", "hyu", "if", "inja", "jud", "ufhy", "uhgf;ffugf", "uhgs"
), class = "factor")), .Names = c("Ids", "string"), class = "data.frame", row.names = c(NA, 
-12L))

I tried to do it using merge

df1 <- mydf[,1:2] 
df2 <- mydf[,3:4]
df3 = merge(df1, df2, by.x=c("IDs", "string"))

which gives me an error because they are unequal

I also tried to use the approach given here How to join (merge) data frames (inner, outer, left, right)? which did not solve my problem

my input is like this

IDs string1        IDs  string2
11  b              4    a
16  uhgf;ffugf     11   b
19  ufhy           16   uhgf;ffugf
21  g              19   ufhy
22  hyu            20   higf;hdugd
24  uhgs           22   hyu
42  jud            29   inja
43  hue     
51  if

and the output looks like this

Ids string
4   a
11  b
16  uhgf;ffugf
19  ufhy
20  higf;hdugd
21  g
22  hyu
24  uhgs
29  inja
42  jud
43  hue
51  if

e.g. 11, 16 etc are repeated twice , so we only want them once

In your `mydf`, the 11 have a matching string in both 'a' and 'b', so why is '11 a' left out in the myout — akrun, Dec 11 '16 at 09:47
@akrun should be only one of them because they are similar. they should only repeat once and not twice, I made a visualisation above — nik, Dec 11 '16 at 09:49

akrun · Accepted Answer · 2016-12-11T10:02:26.347

2

We can do an rbind and remove the duplicated elements

library(data.table)
setnames(rbindlist(list(mydf[3:4], mydf[1:2]))[!is.na(IDs.1)&!duplicated(IDs.1)], 
             c("Ids", "string"))[order(Ids)]
#    Ids     string
# 1:   4          a
# 2:  11          b
# 3:  16 uhgf;ffugf
# 4:  19       ufhy
# 5:  20 higf;hdugd
# 6:  21          g
# 7:  22        hyu
# 8:  24       uhgs
# 9:  29       inja
#10:  42        jud
#11:  43        hue
#12:  51         if

Or another option is melt from data.table (to convert to 'long' format) which can take multiple measure patterns, then remove the duplicated 'Ids' and order using 'Ids'.

melt(setDT(mydf), measure = patterns("ID", "string"), na.rm=TRUE, 
     value.name = c("Ids", "string"))[!duplicated(Ids, fromLast=TRUE)
        ][, variable := NULL][order(Ids)]

edited Dec 11 '16 at 10:02

answered Dec 11 '16 at 09:54

akrun

874,273
37
540
662

one problem is they are not sort, look for example 20, 22, 29, 21 etc. – nik Dec 11 '16 at 09:57
I liked and accepted it. is it possible to also give me an idea how to find how many are similar in both IDs and how many and which ones are dissimilar ? – nik Dec 11 '16 at 10:01
@nik Perhaps `length(Reduce(intersect, mydf[grep("ID", names(mydf))])) #[1] 4; nrow(mydf)-length(Reduce(intersect, mydf[grep("ID", names(mydf))])) #[1] 5` – akrun Dec 11 '16 at 10:04
@nik I used it on the data.frame try `setDF(mydf)` and then do it – akrun Dec 11 '16 at 10:08
you are right it does it but it only gives me a number. is it possible to know which ones too? many thanks – nik Dec 11 '16 at 10:09
@nik Just remove the `length(` wrapper and you will find the elements that are common – akrun Dec 11 '16 at 10:10
Thank you so much!!! the last question and I leave you alone :-D `nrow(mydf)-Reduce(intersect, mydf[grep("ID", names(mydf))])`one does not work , also look at my previous question. I will definitely like and accept your answer – nik Dec 11 '16 at 10:13
@nik You are subtracting a numeric value `nrow(mydf)` with a character vector. In this case, you need `length(Reduce(...` – akrun Dec 11 '16 at 10:14
@nik There is an answer posted on that question and it seems fine to me – akrun Dec 11 '16 at 10:15
thanks but then I cannot know which one from the first IDs are different than the other one ? when i keep the length(Reduce(... it only gives me one value , how many are different – nik Dec 11 '16 at 10:19
@nik YOu just create an object `v1 <- Reduce(intersect, mydf[grep("ID", names(mydf))]j); length(v1)` so, here you can get both the names and length. Also, finding which ones are different `v2 <- Reduce(setdiff, mydf[grep("ID", names(mydf))])` – akrun Dec 11 '16 at 10:46
thanks , the other question , the answer does not work. i posted another data to check the algorithm, it is not working right – nik Dec 11 '16 at 11:04

merge based on an id with missing values and string

1 Answers1