How to automatize comparison of values in two columns of two different data.frames in r?

Question

I'm trying to automatize the following function for all columns of two data frames.

I have the data.frames position_sc and position_hit. Both have 18 columns (each column represents a subject ("PD.."). The rows represent the sentence numbers (sentence comprehension test with four conditions: sc, sf, lc, lf, total number of sentences 200) In position_sc the numbers in the rows represent the sentence numbers which had the condition sc (e.g. 3, 7, 19 --> these sentences were presented under condition sc) In position_hit the rows represent for which sentence of the 200 the subjects had a hit (e.g. 3, 22, 78 --> the subject gave the correct answer for these sentences). Now I want to get the sentences the subjects had correct (hit) in condition sc. So the sentence number are the same when they had a sc-sentence correct.

I want to have the number of equal values in position_hit$x and position_sc$x. I was able to do that for each column (=subject) seperately, but I can't figure out, how to automatize it over all columns (compare position_sc$PD02 with position_hit$PD02, position_sc$PD03 with position_hit$PD03, position_sc$PD04 with position_hit$PD04, position_sc$PD05 with position_hit$PD03, aso.

That works well:

test_single = length(intersect(position_sc$PD03, position_hit$PD03))

OUTPUT
[1] 40

--> (40 numbers are equal --> 40 sentences in condition sc were a hit)

I tried something like that:

#Create a list containing both data.frames.
list_position_sc_hit = list(position_hit, position_sc)

#Try to automatize intersect() first:
test_all = sapply(list_positions_sc_hit, function(x) intersect(position_sc$x, position_hit$x))

But OUTPUT:
$hit
NULL

$sc
NULL

Here is an example how my script looks like: REPRODUCIBLE EXAMPLE

#CREATE 2 DATAFRAMES
#Dataframe 1
df.sc = data.frame (PD02 = c(1,3,7,9,12),
                  PD03 = c(8,9,10,11,13),
                  PD04 = c(1,2,3,24,36),
                  PD05 = c(1,9,20,23,26))
df.sc

#Dataframe 2
#Create list of 4
list.hit = list(PD02 = c(1,3,7,8,11,13,14,15,16,17),
                  PD03 = c(2,4,6,10,11,13,18,19,21),
                  PD04 = c(1,2,3,4,6),
                  PD05 = c(18,19,20,23,25,26))
#Convert list with different number of rows into data.frame
df.hit = as.data.frame(t(stri_list2matrix(list.hit)))
#Rows and columns are switched
df.hit = data.frame(t(df.hit))
#Name columns like in df.sc
names(df.hit)[1:4] = names(df.sc)[1:4]
#Convert all values in data.frame from chr to int
df.hit = sapply(df.hit, as.integer)
# After converting from chr to int the data.frame became an array. In order to reshape it I had to switch row and column twice (not sure why)
df.hit = data.frame(t(df.hit))
df.hit = data.frame(t(df.hit))
df.hit

#Do intersection and length counting for one subject (column)
matches.hit_sc_PD02 = length(intersect(df.sc$PD02, df.hit$PD02))
matches.hit_sc_PD02

OUTPUT: [3] --> three integers are the same in df.sc$PD02 and df.hit$PD02

#Combine both dataframes into list and rename dataframes
df.hit_sc = list(df.hit, df.sc)
names(df.hit_sc) = c("hit", "sc")
df.hit_sc

#Just try intersect() function for all variables (columns) with sapply(). I left length() out just to minimize complexity. 
matches.hit_sc_all = sapply(df.hit_sc, function(x) intersect(df.sc$x, df.hit$x))
matches.hit_sc_all

OUTPUT: $hit NULL

$sc NULL

I would be happy if anyone could help me.

Welcome to Stack Overflow! You should provide a [minimal reproducible example of your code and data along with relevant errors](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Please [edit] your question to include a [mre] so that readers can run your code to answer your question. — Maël, Aug 03 '23 at 08:26

evanpmorier · Accepted Answer · 2023-08-03T14:36:40.330

If I'm understanding the question correctly, you want to test how many values are shared between two columns of the same name from two different data frames. Here is a simple solution:

# assert that variable names are identical and in the same order
stopifnot(names(df.sc) == names(df.hit))

map2(df.sc, df.hit, ~ length(intersect(.x, .y)))

This will return a list with one element per variable name:

$PD02
[1] 3

$PD03
[1] 3

$PD04
[1] 3

$PD05
[1] 3

If you want to return a tibble (data.frame), which may be easier to work with, you can use map2_dfc() instead:

map2_dfc(df.sc, df.hit, ~ length(intersect(.x, .y)))

which returns:

# A tibble: 1 x 4
   PD02  PD03  PD04  PD05
  <int> <int> <int> <int>
1     3     3     3     3

If this solves your problem I'd appreciate an upvote and accepting the solution. Thanks and hope this helps!

Edit: the map* functions are in the purrr package from the tidyverse.

Amazing! It worked. Thank you so much. I was thinking the whole time that I sapply() is the solution. — Tena, Aug 04 '23 at 05:37
No problem! I think the base R equivalent that can work on multiple inputs is mapply(). But I usually use the tidyverse version of functions to keep my code consistent (which is why I suggested map2) — evanpmorier, Aug 24 '23 at 15:48

How to automatize comparison of values in two columns of two different data.frames in r?

1 Answers1