Number of Matches Between Two Comma Separated Factors in a Data Frame

Question

I have a dataframe that looks something like this:

Row    ID1    ID2    Colors1        Colors2
1      1      2      Green, Blue    Red, Orange
2      1      3      Green, Orange  Orange, Red

I would like to create a calculation that tells me the count of colors in common between Colors1 and Colors2. The desired result is the following:

Row    ID1    ID2    Colors1                Colors2         Common 
1      1      2      Green, Blue, Purple    Green, Purple   2     #Green, Purple
2      1      3      Green, Orange          Orange, Red     1     #Orange

score 2 · Answer 1 · answered Mar 29 '14 at 04:47

An alternative approach is to treat the first column as a regular expression to search in the second column and make use of the "stringi" package to facilitate the vectorized searching of the patterns.

df <- structure(list(Colors1 = c("Green, Blue, Purple", "Green, Blue", 
"Green, Blue, Purple"), Colors2 = c("Green, Purple", "Green, Purple", 
"Orange, Red")), .Names = c("Colors1", "Colors2"), row.names = c("2", 
"21", "3"), class = "data.frame")

df
#                Colors1       Colors2
# 2  Green, Blue, Purple Green, Purple
# 21         Green, Blue Green, Purple
# 3  Green, Blue, Purple   Orange, Red

library(stringi)
stri_extract_all_regex(df$Colors2, gsub(", ", "|", df$Colors1))
# [[1]]
# [1] "Green"  "Purple"
# 
# [[2]]
# [1] "Green"
# 
# [[3]]
# [1] NA

stri_count_regex(df$Colors2, gsub(", ", "|", df$Colors1))
# [1] 2 1 0

Basically, what I've done there is use gsub to convert the "Colors1" column to a regular expression search pattern that looks like "Green|Blue|Purple" instead of "Green, Blue, Purple" and used that as the search pattern in each of the "stringi" functions I demonstrated above.

You could use some `stri_replace` function :) – bartektartanus Apr 08 '14 at 19:39 — bartektartanus, Apr 08 '14 at 19:39

score 1 · Accepted Answer · answered Mar 28 '14 at 23:48

1

You can use:

col1 <- strsplit(df$Colors1, ", ")
col2 <- strsplit(df$Colors2, ", ")
df$Common <- sapply(seq_len(nrow(df)), function(x) length(intersect(col1[[x]], col2[[x]])))

Example

df <- data.frame(Colors1 = c('Green, Blue', 'Green, Blue, Purple'), Colors2 = c('Green, Purple', 'Orange, Red'), stringsAsFactors = FALSE)
col1 <- strsplit(df$Colors1, ", ")
col2 <- strsplit(df$Colors2, ", ")
df$Common <- sapply(seq_len(nrow(df)), function(x) length(intersect(col1[[x]], col2[[x]])))
df
#               Colors1         Colors2   Common
# 1         Green, Blue   Green, Purple        1
# 2 Green, Blue, Purple   Orange, Red          0

answered Mar 28 '14 at 23:48

Robert Krzyzanowski

9,294
28
24

Thank you, it worked. The sapply statement is difficult for me to understand - any further explanation would be appreciated. – user2980491 Mar 29 '14 at 01:04
More specifically, why is sapply required? Why doesn't the code length(intersect(col1, col2)) work? – user2980491 Apr 01 '14 at 17:30

Number of Matches Between Two Comma Separated Factors in a Data Frame

2 Answers2

Example

Linked