Compare every n observations with the ith set of observations in 2 data frames in R

Question

I want to compare 2 data frames. One data frame has 400k observations the other 100k. I want to compare every observation in the shorter of the 2 with every set of 4 in the larger in sequence. In other words the 1st observation in b (the shorter DF) with the first 4 observations in a (the larger DF), the second in b with the second set of 4 in a... etc. Id like to count the number of times theres a match.

c = 0
x = 0
d = 1
e = 4

for (x in b) {
    if(a[d:e,1] = x){
        c+1
    }
    x=x+1
    d=d+4
    e=e+4
}

score 0 · Answer 1 · answered Jan 29 '18 at 22:46

I tried to tackle your question below but it was a bit difficult because your question was a bit vague. Check out this guidance on how to write a good question How to make a great R reproducible example?.

I hope this code helps you get on the right track!

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# create two data frames, with specified dimensions
set.seed(123)
a_large_df <- data.frame(sample_a = sample(1:100, 400, TRUE))
head(a_large_df)
#>   sample_a
#> 1       29
#> 2       79
#> 3       41
#> 4       89
#> 5       95
#> 6        5
b_small_df <- data.frame(sample_b = sample(1:100, 100, TRUE))
head(b_small_df)
#>   sample_b
#> 1       99
#> 2       14
#> 3       91
#> 4       58
#> 5       40
#> 6       45

# create a group index column every 4 rows
a_large_df <- a_large_df %>%
  mutate(group_of_4_index = (seq(nrow(a_large_df))-1) %/%4)
head(a_large_df)
#>   sample_a group_of_4_index
#> 1       29                0
#> 2       79                0
#> 3       41                0
#> 4       89                0
#> 5       95                1
#> 6        5                1

# create an index column every row starting from 0 to match above
b_small_df <- b_small_df %>%
  mutate(group_of_4_index = seq(nrow(b_small_df))-1)
head(b_small_df)
#>   sample_b group_of_4_index
#> 1       99                0
#> 2       14                1
#> 3       91                2
#> 4       58                3
#> 5       40                4
#> 6       45                5

# combine the two dataframes by the index
a_b_df <- left_join(a_large_df, b_small_df, by = "group_of_4_index")
head(a_b_df)
#>   sample_a group_of_4_index sample_b
#> 1       29                0       99
#> 2       79                0       99
#> 3       41                0       99
#> 4       89                0       99
#> 5       95                1       14
#> 6        5                1       14

# check if the values of the samples match per group, and if so mark "yes" 
a_b_df <- a_b_df %>%
  group_by(group_of_4_index) %>%
  mutate(match = if_else(sample_a %in% sample_b, "yes", "no"))
head(a_b_df)
#> # A tibble: 6 x 4
#> # Groups:   group_of_4_index [2]
#>   sample_a group_of_4_index sample_b match
#>      <int>            <dbl>    <int> <chr>
#> 1       29                0       99    no
#> 2       79                0       99    no
#> 3       41                0       99    no
#> 4       89                0       99    no
#> 5       95                1       14    no
#> 6        5                1       14    no

table(a_b_df$match)
#> 
#>  no yes 
#> 392   8

How do I prevent repeats? I can't he any repeat numbers in each set of 4 — , Jan 30 '18 at 00:59
in the `sample` function change replace to FALSE e.g. `sample(1:100, 100, FALSE)` — Greg B, Jan 30 '18 at 01:02
I tried that. It throws an error. I need to produce n 1:10, 100000 — , Jan 30 '18 at 01:22
i am not exactly sure what you are looking for. can you please be more specific. `rep(sample(1:10, 4, FALSE), 25000)` will produce a random sample of 4 numbers from 1:10 without replacement. This sampling will then be replicated 25,000 times to produce a total vector of length 100,000. If you need additional help please provide details of your actual code. Thanks! — Greg B, Jan 30 '18 at 16:42
I need 2 data frames both with nonrepeating numbers from 1:10 1) in unique sets of 4, the other is fine just the way you set it up. — , Jan 30 '18 at 21:20

Compare every n observations with the ith set of observations in 2 data frames in R

1 Answers1