SparklyR separate one Spark DataFrame column into two columns

Question

I have a dataframe containing a column named COL which is structured in this way:

VALUE1###VALUE2

The following code is working

library(sparklyr)
library(tidyr)
library(dplyr)
mParams<- collect(filter(input_DF, TYPE == ('MIN')))
mParams<- separate(mParams, COL, c('col1','col2'), '\\###', remove=FALSE)

If I remove the collect, I get this error:

Error in UseMethod("separate_") : 
  no applicable method for 'separate_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"

Is there any alternative to achieve what I want, but without collecting everything on my spark driver?

dalloliogm · Answer 1 · 2019-09-06T14:52:19.897

You can use ft_regex_tokenizer followed by sdf_separate_column.

ft_regex_tokenizer will split a column into a vector type, based on a regex. sdf_separate_column will split this into multiple columns.

mydf %>% 
    ft_regex_tokenizer(input_col="mycolumn", output_col="mycolumnSplit", pattern=";") %>% 
    sdf_separate_column("mycolumnSplit", into=c("column1", "column2")

UPDATE: in recent versions of sparklyr, the parameters input.col and output.col have been renamed to input_col and output_col, respectively.

score 2 · Answer 2 · edited Feb 17 '21 at 08:47

2

Sparklyr version 0.5 has just been released, and it contains the ft_regex_tokenizer() function that can do that:

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false).

library(dplyr)
library(sparklyr)
ft_regex_tokenizer(input_DF, input_col = "COL", output_col = "ResultCols", pattern = '\\###')

The splitted column "ResultCols" will be a list.

edited Feb 17 '21 at 08:47

dalloliogm

8,718
6
45
55

answered Jan 25 '17 at 08:28

Jaime Caffarel

2,401
4
30
42

I know `ft_regex_tokenizer`, but the question was to separate the values and store it in 2 columns and not in 1 list column. `tidyr::unnest` is just working locally after `collect`, which is not appropriate in my case, because I have to aggregate the data using 1 output column ... – nachti Jan 31 '18 at 14:00
1

Note: in recent versions of sparklyr, the parameters input.col and output.col have been renamed to input_col and output_col, respectively. – dalloliogm Sep 06 '19 at 14:52
I would suggest editing the answer to show input_col and output_col now that this is the current implementation. – Joey Oct 12 '19 at 11:48

SparklyR separate one Spark DataFrame column into two columns

2 Answers2

Linked