How to generate new id by comparing two data frames

Question

I have two data frames. one is

PatientID   Name          DOB          SEX
1000017863  awssV       04-01-1936      F
1000017898  wrafdU      21-03-1971      M
1000017947  asfadfdV    29-04-1949      F
1000018029  dgdbcASK    28-12-1953      F
1000017898  wrafdU      21-03-1971      M
1000018164  adcv  K     22-05-1952      M
1000018181  asfvvR      12-06-1956      M

And one more is Empty table with column names

 ParetID  PatientID    Name       DOB      SEX

Now, I have to compare these two tables by matching name, sex and dob. if not matches then create new auto increment paretId by copying all other fields.

output is like

ParetID    PatientID    Name        DOB          SEX
001       1000017863    awssV      04-01-1936      F
002       1000017898    wrafdU     21-03-1971      M
003       1000017947    asfadfdV   29-04-1949      F
004       1000018029    dgdbcASK   28-12-1953      F
002       1000017898    wrafdU     21-03-1971      M

Your question doesn't make much sense to me. Can you show your desired output? Please see [how-to-make-a-great-r-reproducible-example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — David Arenburg, Jun 17 '18 at 08:19
Seems like this'll be a candidate for `dplyr::anti_join` and/or `rbind(...)` followed by `cumsum(is.na(.))`. — r2evans, Jun 17 '18 at 09:04
You say *"if not matches"*. But one of the tables is empty. Can you show us a match? — Rui Barradas, Jun 17 '18 at 09:20
At the beginning it is empty so it will generate new paretid for all values. suppose same values comes again then it has keep only old paretid — Madhusudan, Jun 17 '18 at 09:26

score 1 · Accepted Answer · answered Jun 17 '18 at 10:23

Your outcome is a bit strange. I would create a parent data.frame and just add new records to that one, and copy the parentid to the other data. Not introduce duplicates in a parent data.frame. Below is something you could use.

step1: create parent data.frame from your initial data.frame (df1)

library(dplyr)

parents <- df1 %>%
  # remove dublicates.
  unique() %>% 
  mutate(ParentId = row_number())

   PatientID     Name        DOB SEX ParentId
1 1000017863    awssV 04-01-1936   F        1
2 1000017898   wrafdU 21-03-1971   M        2
3 1000017947 asfadfdV 29-04-1949   F        3
4 1000018029 dgdbcASK 28-12-1953   F        4
5 1000018164  adcv  K 22-05-1952   M        5
6 1000018181   asfvvR 12-06-1956   M        6

step2: adding new records to parent data.frame

parents <- df2 %>% 
  # remove dublicates
  unique() %>% 
  anti_join(parents) %>% 
  # add new rows on the bottom of parents
  bind_rows(parents, .) %>% 
  mutate(ParentId = ifelse(is.na(ParentId), row_number(), ParentId))

   PatientID     Name        DOB SEX ParentId
1 1000017863    awssV 04-01-1936   F        1
2 1000017898   wrafdU 21-03-1971   M        2
3 1000017947 asfadfdV 29-04-1949   F        3
4 1000018029 dgdbcASK 28-12-1953   F        4
5 1000018164  adcv  K 22-05-1952   M        5
6 1000018181   asfvvR 12-06-1956   M        6
7 1000020202     asdf 05-05-1966   F        7     #<<< new record

step3: add parentid to original data is simply using an inner_join.

df1 %>% inner_join(parents) 
Joining, by = c("PatientID", "Name", "DOB", "SEX")
   PatientID     Name        DOB SEX ParentId
1 1000017863    awssV 04-01-1936   F        1
2 1000017898   wrafdU 21-03-1971   M        2   #<<<< duplicate entries, same parentid.
3 1000017947 asfadfdV 29-04-1949   F        3
4 1000018029 dgdbcASK 28-12-1953   F        4
5 1000017898   wrafdU 21-03-1971   M        2   #<<<< duplicate entries, same parentid.
6 1000018164  adcv  K 22-05-1952   M        5
7 1000018181   asfvvR 12-06-1956   M        6

data:

df1 <- structure(list(PatientID = c(1000017863L, 1000017898L, 1000017947L, 
                             1000018029L, 1000017898L, 1000018164L, 1000018181L), 
               Name = c("awssV","wrafdU", "asfadfdV", "dgdbcASK", "wrafdU", "adcv  K", "asfvvR"),
               DOB = c("04-01-1936", "21-03-1971", "29-04-1949", "28-12-1953", 
                                        "21-03-1971", "22-05-1952", "12-06-1956"),
               SEX = c("F", "M", "F", "F", "M", "M", "M")), 
          class = "data.frame", row.names = c(NA, -7L))

df2 <- structure(list(PatientID = c(1000017863L, 1000017898L, 1000020202L), 
               Name = c("awssV", "wrafdU", "asdf"), 
               DOB = c("04-01-1936", "21-03-1971", "05-05-1966"), 
               SEX = c("F", "M", "F")), 
          class = "data.frame", row.names = c(NA, -3L))

instead of `anti_join` and `bind_rows`, can't you just `full_join` and backfill `ParetID`? — r2evans, Jun 17 '18 at 10:25
Your code is helpful but Is there any way to that using loop function — Madhusudan, Jun 17 '18 at 10:34
@r2evans, yes `full_join` would work, but with `anti_join` and `bind_rows` I determine where the new rows appear. `full_join` doesn't add the new rows at the end. — phiver, Jun 17 '18 at 10:38
@Madhusudan, why would you want to use a loop? It is not necessary and makes it more complicated — phiver, Jun 17 '18 at 10:41
@phiver, true, but don't the `NA`s in `ParetID` (or `ParentID`) give away the new rows? A full join would have the same effect. (They both work, I think, just wondering without testing if there's a logical or practical difference between them.) — r2evans, Jun 17 '18 at 10:51
@phiver, I have to get answer only through loop statements so please help me. — Madhusudan, Jun 17 '18 at 10:59
@r2evans, yes arranging on parentid and then filling in the `NA` would work as well. Just ran a small benchmark and the difference of the mean times between the versions is about 0.7 milliseconds. with the full join being slightly slower on all metrics. But not by much. So both options would in practice be about the same speed. — phiver, Jun 17 '18 at 11:02

How to generate new id by comparing two data frames

1 Answers1