0

I am looking for a dplyr way to break variable into multiple columns according to dictionary:

vardic <- data.frame(varname=c('a','b','c','d'),
                     length=c(2,6,3,1) ) %>% 
               mutate(end=cumsum(length),start=end-length+1)

d <- data.frame(orig_string=c('11333333444A',
                              '22444444111C',
                              '55666666000B'))

The desired output is:

d2 <- data.frame(a=c(11,22,55),b=c(333333,444444,666666),c=c(444,111,000),d=c('A','C','B')

This has to be done using only dplyr commands because this will be implemented via arrow on a larger than memory dataset (asked in this other question)

UPDATE (responding to comments): functions outside dplyr could be used, as long as supported by arrow. arrow's list of R/dplyr supported functions describes what has been implemented so far. Hopefully this pseudocode illustrates the pipeline:

library(tidyverse)
library(arrow)
d %>% write_dataset('myfile',format='parquet')
'myfile' %>% open_dataset %>% 
    sequence_of_arrowsupported_commands_to_split_columns

Update2: added cols indicating start and end position in vardic

Update3: made the arrow pipeline, above, more reproducible. then tested @akrun's solution. But separate is not supported by arrow

LucasMation
  • 2,408
  • 2
  • 22
  • 45
  • Looks like a request to read fixed width files on basis of a set of lengths. – IRTFM Nov 01 '22 at 19:21
  • Can you clarify "only dplyr commands"? What about base R, core packages like utils, or other tidyverse packages like stringr or tidyr? – zephryl Nov 01 '22 at 19:50
  • @IRTFM, that is correct, if you click on the other question referred above, you will see that I am trying to implement an arrow based FWF reader – LucasMation Nov 01 '22 at 20:50
  • It looks like this might be a better list of functions exposed by the Arrow API? [https://arrow.apache.org/docs/r/reference/acero.html](https://arrow.apache.org/docs/r/reference/acero.html) – zephryl Nov 02 '22 at 02:43
  • It sounds like you're handling a fixed-width file probably best read in with something like `readr::read_fwf(path, col_positions = readr::fwf_widths(vardic$length, col_names = vardic$varname))`. You could convert it into a CSV arrow can handle with sed or awk if you want. – alistaire Nov 02 '22 at 20:55
  • @alistaire. Tks. Yes, I am doing that, but I am exploring the case where the fwf do not fit into memory – LucasMation Nov 02 '22 at 21:03
  • 2
    There's a ticket for support for `tidyr::separate()` ([ARROW-8813](https://issues.apache.org/jira/browse/ARROW-8813)), which would offer a reasonably quick way to deal with this – alistaire Nov 03 '22 at 15:11
  • @IRTFM, more broadly, I think votes are secret, and OPoster upvoting or not should not make a difference in the long run. The community of future readers is the one that will evaluate which answer is best, and that is the "game", if any. I think the main point is that my question morphed through the edits into "a solution that works within arrow implemented functions". But I recognize tittle does not match the current state of the question. Anyway, I will erase this and previous comment soon, I hope you can do the same. Vote shenanigans detract from question interest – LucasMation Nov 04 '22 at 02:31
  • I’m not worried about the title. The problem is extraction of data from an R6 “environment”. I suspect the solution to your dplyr requirements will be to create a function that can deliver the text after extraction. – IRTFM Nov 04 '22 at 04:52

3 Answers3

2

Not sure, as others, what exactly you mean by 'only' dplyr. If by that you mean only tidyverse, here's a solution that relies on dplyr, tidyr and stringr:

library(dplyr)
library(stringr)
library(tidyr)
d2 <- d %>%
  mutate(orig_string = str_extract_all(orig_string, "(.)\\1+(?!\\1)|[A-Z]$")) %>%
  unnest_wider(orig_string)
names(d2) <- vardic$varname
# A tibble: 3 × 4
  a     b      c     d    
  <chr> <chr>  <chr> <chr>
1 11    333333 444   A    
2 22    444444 111   C    
3 55    666666 000   B  

EDIT:

Here's a fully automated tidyverse solution:

library(tidyr)
d %>%
  separate(orig_string, 
           into = vardic$varname, 
           sep = cumsum(vardic$lenght))
   a      b   c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • @Chis, tks. This is clear and concise. As mentioned in one of the comments to the OP tidyr::separate() is not yet supported by arrow, but there is a feature request for that. So maybe it will be soon. – LucasMation Nov 04 '22 at 18:40
  • @Cris, please delete the comments above – LucasMation Nov 04 '22 at 18:40
2

OP here. Tks for all the support. All other answers are great and work well with purr + dplyr. However, these loop/map through the variables have not been implemented in arrow yet. One solution, however, is to have the loop outside and the arrow command be repeated for eah variable:

For instance (a hard coded sequence would be):

ds <- 'file' %>% open_dataset 
ds <- ds %>% mutate(a=str_sub(orig_string,1,2))
ds <- ds %>% mutate(b=str_sub(orig_string,3,8))
...
ds %>% collect

Now reimplement this as a function + a loop:

extract_var_arrow <- function(ds,var){
  s <- vardic[varname==var]$start
  e <- vardic[varname==var]$end
  ds %>% mutate("{var}" := str_sub(orig_string,s,e)) %>% return
}
for(v in vardic$varname){
  ds <- ds %>% extract_var_arrow(v)
}

Note that, until it sees a collect statement arrow, is just compiling a query. So the above loop is equivalent to:

   # ds <- ds %>% extract_var_arrow('a') 
   #             %>% extract_var_arrow('b')
   #                %>% extract_var_arrow('c')
   #                   %>% extract_var_arrow('d')

Finally we can collect

ds %>% select(-orig_string) %>% collect
   a      b   c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
LucasMation
  • 2,408
  • 2
  • 22
  • 45
1

Base R solution:

# instantiate d2 with nrow(d) rows and 0 columns
d2 <- d
d2$orig_string <- NULL

for (i in seq(to = nrow(vardic))) {
  d2[[vardic$varname[[i]]]] <- substr(
    d$orig_string, 
    vardic$start[[i]], 
    vardic$end[[i]]
  )
}

d2
   a      b   c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B

If you can use other tidyverse packages, here's a solution using purrr:pmap_dfc():

library(dplyr)
library(purrr)
library(stringr)

pmap_dfc(vardic, \(varname, start, end, ...) tibble(
  !!varname := str_sub(d$orig_string, start = start, end = end)
))
# A tibble: 3 × 4
  a     b      c     d    
  <chr> <chr>  <chr> <chr>
1 11    333333 444   A    
2 22    444444 111   C    
3 55    666666 000   B    
zephryl
  • 14,633
  • 3
  • 11
  • 30
  • tks. It now works. However the arrow pipeline ( `'file' %>% open_dataset %>% pmap_dfc(...)` ) still does not work: "Error in `as_mapper()`: ! Can't convert `.f`, a object, to a function." – LucasMation Nov 02 '22 at 01:45
  • wrapping your code in a function `myfunc <- function(d) pmap_dfc(..)` works for `myfunc(d)`. But for `'file' %>% open_dataset %>% myfunc` returns a data.frame with correct colnames but now rows – LucasMation Nov 02 '22 at 01:50
  • @LucasMation really not familiar enough with arrow to help, but probably reflects that purrr functions, aren't implemented in the arrow API. – zephryl Nov 02 '22 at 02:44