Hi everyone I have the datasets below and i would like to identify each row from each of the two sets in R.
I have tried sorting each in ascending order then appending row indices but still in vain.
As you can see i am trying to detect records marked red on the RHS as missing in the LHS.
Anyone with clue or link to a tutorial I will really appreciate.
The problem looks outrageous but that's what am dealing with currently
This is a simple reprex output.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df1 = data.frame(x = c(2000,3000,3000,48400,2000))
df2 = data.frame(y = c(2000,2000,2000,3000,3000,3000,41200,48400))
df1 = df1 %>% arrange(x) %>% group_by(x) %>% mutate(rownum = row_number()) %>%
mutate(uniqueid = paste0(x,"_",rownum))
df2 = df2 %>% arrange(y) %>% group_by(y) %>% mutate(rownum = row_number()) %>%
mutate(uniqueid = paste0(y,"_",rownum))
left_join(df2,df1)
#> Joining, by = c("rownum", "uniqueid")
#> # A tibble: 8 x 4
#> # Groups: y [4]
#> y rownum uniqueid x
#> <dbl> <int> <chr> <dbl>
#> 1 2000 1 2000_1 2000
#> 2 2000 2 2000_2 2000
#> 3 2000 3 2000_3 NA
#> 4 3000 1 3000_1 3000
#> 5 3000 2 3000_2 3000
#> 6 3000 3 3000_3 NA
#> 7 41200 1 41200_1 NA
#> 8 48400 1 48400_1 48400
df2
#> # A tibble: 8 x 3
#> # Groups: y [4]
#> y rownum uniqueid
#> <dbl> <int> <chr>
#> 1 2000 1 2000_1
#> 2 2000 2 2000_2
#> 3 2000 3 2000_3
#> 4 3000 1 3000_1
#> 5 3000 2 3000_2
#> 6 3000 3 3000_3
#> 7 41200 1 41200_1
#> 8 48400 1 48400_1
The trickiest part here is that the data sets should be arranged as they appear in the picture since each row in each table refer to a specific transaction.
My output above flags the third 2000 value and not the second on as it is supposed to.