I am looking for a dplyr
way to break variable into multiple columns according to dictionary:
vardic <- data.frame(varname=c('a','b','c','d'),
length=c(2,6,3,1) ) %>%
mutate(end=cumsum(length),start=end-length+1)
d <- data.frame(orig_string=c('11333333444A',
'22444444111C',
'55666666000B'))
The desired output is:
d2 <- data.frame(a=c(11,22,55),b=c(333333,444444,666666),c=c(444,111,000),d=c('A','C','B')
This has to be done using only dplyr commands because this will be implemented via arrow on a larger than memory dataset (asked in this other question)
UPDATE (responding to comments): functions outside dplyr
could be used, as long as supported by arrow. arrow's list of R/dplyr supported functions describes what has been implemented so far. Hopefully this pseudocode illustrates the pipeline:
library(tidyverse)
library(arrow)
d %>% write_dataset('myfile',format='parquet')
'myfile' %>% open_dataset %>%
sequence_of_arrowsupported_commands_to_split_columns
Update2: added cols indicating start
and end
position in vardic
Update3: made the arrow pipeline, above, more reproducible. then tested @akrun's solution. But separate
is not supported by arrow