Merging two datasets with no unique columns in R

Question

Hi everyone I have the datasets below and i would like to identify each row from each of the two sets in R.

I have tried sorting each in ascending order then appending row indices but still in vain.

As you can see i am trying to detect records marked red on the RHS as missing in the LHS.

Anyone with clue or link to a tutorial I will really appreciate.

The problem looks outrageous but that's what am dealing with currently

This is a simple reprex output.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df1 = data.frame(x = c(2000,3000,3000,48400,2000))
df2 = data.frame(y = c(2000,2000,2000,3000,3000,3000,41200,48400))
df1 = df1 %>% arrange(x) %>% group_by(x) %>% mutate(rownum = row_number()) %>% 
  mutate(uniqueid = paste0(x,"_",rownum))
df2 = df2 %>% arrange(y) %>% group_by(y) %>% mutate(rownum = row_number()) %>% 
  mutate(uniqueid = paste0(y,"_",rownum))
left_join(df2,df1)
#> Joining, by = c("rownum", "uniqueid")
#> # A tibble: 8 x 4
#> # Groups:   y [4]
#>       y rownum uniqueid     x
#>   <dbl>  <int> <chr>    <dbl>
#> 1  2000      1 2000_1    2000
#> 2  2000      2 2000_2    2000
#> 3  2000      3 2000_3      NA
#> 4  3000      1 3000_1    3000
#> 5  3000      2 3000_2    3000
#> 6  3000      3 3000_3      NA
#> 7 41200      1 41200_1     NA
#> 8 48400      1 48400_1  48400
df2
#> # A tibble: 8 x 3
#> # Groups:   y [4]
#>       y rownum uniqueid
#>   <dbl>  <int> <chr>   
#> 1  2000      1 2000_1  
#> 2  2000      2 2000_2  
#> 3  2000      3 2000_3  
#> 4  3000      1 3000_1  
#> 5  3000      2 3000_2  
#> 6  3000      3 3000_3  
#> 7 41200      1 41200_1 
#> 8 48400      1 48400_1

The trickiest part here is that the data sets should be arranged as they appear in the picture since each row in each table refer to a specific transaction.

My output above flags the third 2000 value and not the second on as it is supposed to.

Please include a minimal reproducible example (https://stackoverflow.com/help/minimal-reproducible-example). Include example data that can be copied (e.g. the output of `dput`). Do not just include an image of your data. — Marcus, Aug 21 '20 at 17:22
Does this answer your question? [How to join (merge) data frames (inner, outer, left, right)](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) — Tushar Lad, Aug 21 '20 at 17:51
@TusharLad ,thanks but the question there appears to have unique IDs,mine has no unique id but just the arrangements. — user322203, Aug 21 '20 at 18:11

score 0 · Answer 1 · answered Aug 21 '20 at 17:01

0

Good question! Start by giving the columns a shared name (e.g., 'index'). Then load the dplyr package and use the anti_join() function like so:

anti_join(df1, df2, by = 'index')

You can also use the following if you don't want to use dplyr:

df1[!df1$index %in% df2$index,]

answered Aug 21 '20 at 17:01

wcbrown

157
7

thanks for the suggestion but if u look closely the second 2000 in df1 actually should correspond to the third 2000 in df2 – user322203 Aug 21 '20 at 18:14

Merging two datasets with no unique columns in R

1 Answers1