You can't pass a vector to the pattern
argument of %like%
since it calls upon grepl/grep
and these aren't vectorized. You could use mapply
to call %like%
for each row to get what you want:
iris[mapply(function(x,y) x %like% y, Species, Species2) ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2
#1: 5.1 3.5 1.4 0.2 setosa set
#2: 4.9 3.6 1.4 0.1 setosa set|vers
#3: 5.9 3.0 5.1 1.8 virginica virginica
Microbenchmark mainly for my own curiosity, but for anyone else interested:
set.seed(1)
dt <- data.table(Species = replicate(100000, paste0(sample(LETTERS, 6), collapse = "")),
Species2 = replicate(100000, paste0(sample(LETTERS, 3), collapse = "")))
microbenchmark::microbenchmark( mapply = dt[mapply(function(x,y) x %like% y, Species, Species2) ],
by_group1 = dt[, .SD[Species[1L] %like% Species2[1L]], by = .(Species, Species2)],
by_group2 = dt[, .SD[Species %like% Species2], by = 1:nrow(dt)],
str_detect = dt[stri_detect_regex(Species, Species2)],
by_species2 = dt[,.SD[Species %like% Species2], by = Species2],
by_species2I = dt[dt[, .I[Species %like% Species2], by = Species2]$V1],
times = 5)
Unit: milliseconds
expr min lq mean median uq max neval
mapply 669.9691 680.2241 700.3758 685.8262 715.8373 750.0224 5
by_group1 10906.2179 10908.0985 10951.5651 10914.7002 11009.0683 11019.7408 5
by_group2 16738.4390 16826.4793 16907.8428 16902.9490 16970.6143 17100.7324 5
str_detect 430.7768 431.1002 432.2279 431.9284 433.3488 433.9855 5
by_species2 2482.7583 2518.6858 2547.5882 2531.4913 2599.0159 2605.9899 5
by_species2I 110.1486 114.6775 115.9223 117.5270 118.5033 118.7553 5
Only ran it 5 times since the by_group*
operations were so slow. Looks like @eddi's method using .I
is that fastest (assuming I have his intended method correct).
Also, re-ran the benchmark using fewer groups, it seems in this case the by_species2I
is still the fastest, and the other by_group*
are still slowest by a lot (makes sense since the # of groups for by_group2
is always the data size and for by_group1
it's going to be close to the data size).
set.seed(1)
dt <- data.table(Species = replicate(100000, paste0(sample(LETTERS, 3), collapse = "")),
Species2 = replicate(100000, paste0(sample(LETTERS, 2), collapse = "")))
Unit: milliseconds
expr min lq mean median uq max neval
mapply 611.83085 617.60180 639.7778 638.49061 652.80619 678.15932 5
by_group1 10021.48177 10121.00419 10145.6305 10123.01354 10213.37976 10249.27339 5
by_group2 15828.21224 15997.56034 16018.9583 16066.07284 16101.40961 16101.53651 5
str_detect 416.44549 419.83585 420.6042 421.69423 421.85359 423.19194 5
by_species2 106.06793 114.02764 115.5364 117.62331 118.04524 121.91770 5
by_species2I 14.22369 14.72001 15.2137 15.24514 15.38371 16.49597 5