0

I am working with a tbl_spark in sparklyr.

I have a spark Dataframe with two list-type columns, and I would like to output two things:

  1. The intersection of both lists (as a list)
  2. The number of elements in the intersection

My input data looks something like the following (using the mtcars dataset) where "sc" is my spark connection:

library(dplyr)      
library(sparklyr)

## Load mtcars into spark with connection "sc"
mtcars_spark <- copy_to(sc, mtcars)

## Wrangle mtcars to get list columns using ft_regex_tokenizer()
tbl_with_lists <- mtcars_spark %>%
  mutate(mpg_rounded = round(mpg, -1)) %>%
  group_by(mpg_rounded) %>%
    summarize(
      cyl_all = paste(collect_set(as.character(cyl)), sep = ", "),
      gear_all = paste(collect_set(as.character(gear)), sep = ", ")
    ) %>%
  ungroup() %>%
  ft_regex_tokenizer("cyl_all", "cyl_list", pattern = "[,]\\s*") %>%
  ft_regex_tokenizer("gear_all", "gear_list", pattern = "[,]\\s*")

tbl_with_lists

## # Source: spark<?> [?? x 5]
##   mpg_rounded cyl_all       gear_all      cyl_list   gear_list 
##         <dbl> <chr>         <chr>         <list>     <list>    
## 1          10 8.0           3.0           <list [1]> <list [1]>
## 2          30 4.0           5.0, 4.0      <list [1]> <list [2]>
## 3          20 8.0, 6.0, 4.0 5.0, 3.0, 4.0 <list [3]> <list [3]>

I haven't had much success with finding out how to do this. Any ideas?

MRipley
  • 1
  • 2
  • Can you provide `list1` and `list2` as `dput()`? – Matt Jun 06 '22 at 14:48
  • I'm not sure what you mean by dput? It's not a function I've come across before – MRipley Jun 06 '22 at 15:09
  • See [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for how to create a minimally reproducible example. It will help others to help you by providing a minimal amount of code (i.e., the data for `list1` and `list2`). Try running `dput(mtcars)` to see how it works. If `mtcars` was `list1`, you could copy/paste the output to provide in your question. – Matt Jun 06 '22 at 15:16
  • Original post edited to include an explicit reprex with mtcars - is that helpful? I'm wondering if there's a way to mutate the two list-columns (cyl_list and gear_list) to produce 1) a new column with the intersection and 2) a new column with the size of the intersection – MRipley Jun 06 '22 at 15:53

1 Answers1

0

I have found what might be a bit of a workaround using explode().

Would be great if there were a more direct way though? Not sure how well this solution will scale up to larger datasets.

tbl_with_lists %>%
  ## First explode the lists to create new rows for each unique list value
  mutate(
    cyl_explode  = explode(cyl_list)
  ) %>%
  mutate(
    gear_explode = explode(gear_list)
  ) %>%

  ## Summarize to count number of matches - this gives the size of the intersection of the two lists
  group_by(mpg_rounded, cyl_list, gear_list) %>%
  summarize(size_of_intersection = sum(as.integer(cyl_explode == gear_explode)))


## # Source: spark<?> [?? x 4]
## # Groups: mpg_rounded, cyl_list
##   mpg_rounded cyl_list   gear_list  size_of_intersection
##         <dbl> <list>     <list>                    <dbl>
## 1          10 <list [1]> <list [1]>                    0
## 2          30 <list [1]> <list [2]>                    1
## 3          20 <list [3]> <list [3]>                    1
MRipley
  • 1
  • 2