I am working with a tbl_spark in sparklyr.
I have a spark Dataframe with two list-type columns, and I would like to output two things:
- The intersection of both lists (as a list)
- The number of elements in the intersection
My input data looks something like the following (using the mtcars dataset) where "sc" is my spark connection:
library(dplyr)
library(sparklyr)
## Load mtcars into spark with connection "sc"
mtcars_spark <- copy_to(sc, mtcars)
## Wrangle mtcars to get list columns using ft_regex_tokenizer()
tbl_with_lists <- mtcars_spark %>%
mutate(mpg_rounded = round(mpg, -1)) %>%
group_by(mpg_rounded) %>%
summarize(
cyl_all = paste(collect_set(as.character(cyl)), sep = ", "),
gear_all = paste(collect_set(as.character(gear)), sep = ", ")
) %>%
ungroup() %>%
ft_regex_tokenizer("cyl_all", "cyl_list", pattern = "[,]\\s*") %>%
ft_regex_tokenizer("gear_all", "gear_list", pattern = "[,]\\s*")
tbl_with_lists
## # Source: spark<?> [?? x 5]
## mpg_rounded cyl_all gear_all cyl_list gear_list
## <dbl> <chr> <chr> <list> <list>
## 1 10 8.0 3.0 <list [1]> <list [1]>
## 2 30 4.0 5.0, 4.0 <list [1]> <list [2]>
## 3 20 8.0, 6.0, 4.0 5.0, 3.0, 4.0 <list [3]> <list [3]>
I haven't had much success with finding out how to do this. Any ideas?