extract several variables from string position using a dictionary (varname,startposition,endposition) in data.table

Question

I have a data.table, DT containing one string variable, s, from which I want to create additional variables based on the positions of each character in s. The varname, start and end positions and vartype of each variable are given in a dictionary, dic. Here is an example, including the desired output:

DT <- DT <- data.table(s=c('a191','b292','c393'))
dic <- data.table(varname=c('bla,ble,bli'),start=c(1,2,3),end=c(1,2,4),vartype=c('c','i','i')
DTdesired <- data.table(bla=c('a','b','c'),ble=c(1,2,3),bli=c(91,92,93))

This is the same problem as importing data from a fixed width file (fwf), which has been covered in this SO question. The answer to that post includes a data table solution

 DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
      stri_sub(V1, cols$beg[ii], cols$end[ii])
    })]

But I am looking for a solution that besides start and end would also use the varname and coltype variables of the dictionary

Edit: In reality, DT would be a large Census file, around 200 million observations. Each element of s is a string with 200 characters. And the dictionary, dic, can contain about 10 to 50 variables to be extracted, depending on the year.

Try `type.convert(dic[, setNames(Map(function(x, y) substr(DT$s, x, y), star, end), strsplit(varname[1], ",")[[1]])][, s := DT$s][], as.is = TRUE)` — akrun, Aug 04 '22 at 20:53
@akrun, as always, tks! This looks like black magic to me. But it works. In case you want to submit this as an answer, for clarity I fixed a typo in the original post (start not star), and removed column s from the output file. Do you expect that this be memory efficient and fast? I have a dataset with 200mm obs, extracting 12 columns. The server I used has been stuck executing read_fwf for 24hs.... — LucasMation, Aug 04 '22 at 21:08
You meant `200 million` observations?. This code is actually looped by row of the 'dic' data. So may not be that efficient — akrun, Aug 04 '22 at 21:16
The dictionary has only 15 rows, corresponding to the 15 variables to be extracted. But DT has 200 million observations. — LucasMation, Aug 04 '22 at 21:20
it could also increase efficiency if `type.convert()` could run within each iteration of the loop. Optimize each column at the time, instead of at the end — LucasMation, Aug 04 '22 at 21:23
@akrun, this works quite well! I edited a bit. `subtr_typeconvert <- function(x, y) DT$s %>% substr(x, y) %>% type.convert(as.is = TRUE)`. Then loop: `DTdesired <- dic[, setNames(Map(function(x, y) subtr_typeconvert(x,y), start, end), varname)]`. Please add as answer so I can solve this issue. — LucasMation, Aug 05 '22 at 19:10

score 1 · Accepted Answer · answered Aug 06 '22 at 16:50

We can use Map to do the conversion. (added the updated function created by @LucasMation)

library(magrittr)
library(data.table)
subtr_typeconvert <- function(x, y)
     {
 DT$s %>%
   substr(x, y) %>%
     type.convert(as.is = TRUE)
}
dic[, setNames(Map(function(x, y)
      subtr_typeconvert(x,y), start, end), varname)]

extract several variables from string position using a dictionary (varname,startposition,endposition) in data.table

1 Answers1

Linked