Splitting a single column into multiple observation using R

Question

I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference :

code            label
61000-61003     excision of CNS
0169T-0169T     ventricular shunt

The desired output should be :

code            label
61000           excision of CNS
61001           excision of CNS
61002           excision of CNS
61003           excision of CNS
0169T           ventricular shunt

My approach to this problem is using the package splitstackshape and using this code

library(data.table)
library(splitstackshape)

cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)]

This approach leads to memory issues. Is there a better approach to this problem?

Some comments :

The data has many letters apart from "T".
The letter can be either in the front or at the very end but not in between two numbers.
There is no change of letter from "T" to "U" in one single range

Hmmm I'm not very experienced with data.table, but I can't see how your apprach could work - `Code_1` (shouldn't it be `code_1`?) and `code_2` must be numeric, if you want to build a sequence, e.g. `hcup <- read.table(header=T, stringsAsFactors = F, text="code label\n61000-61003 excision_of_CNS\n0169T-0169T ventricular_shunt"); cSplit(hcup, "code", "-")[, list(Code = seq(as.integer(gsub("\\D", "", code_1)), as.integer(gsub("\\D", "", code_2)))), by = label]`. — lukeA, Oct 13 '15 at 22:30
Thanks. I have accepted the edits. I am not particular about "splitstackshape" as such. Is there a possibility to write a function which can handle this problem? — x1carbon, Oct 13 '15 at 22:36
This might be of help from the `splitstackshape` documentation: If you know that all values in the column would have the same number of values per row after being split, you should use the `cSplit_f` function instead, which uses `fread` instead of `strsplit` and is generally faster. — Cotton.Rockwood, Oct 13 '15 at 23:26
So maybe you can give us a little more info. It the letter `T` always the letter? Is it always at the end of the string? — Rich Scriven, Oct 14 '15 at 15:49
First - The data has many letters apart from "T". Second - The letter can be either in the front or at the very end but not in between two numbers. Third - There is no change of letter from "T" to "U" in one single range. — x1carbon, Oct 14 '15 at 16:16
One can access the original data here - [link](http://www.hcup-us.ahrq.gov/toolssoftware/ccs_svcsproc/ccssvcproc.jsp) — x1carbon, Oct 14 '15 at 16:23
To second-guess the question, I think expanding the data frame might not be the thing that you want to do in the end. Splitting the code column into `begin` and `end`, and storing the `code.prefix` and `code.suffix` seems like it would make matching a lot simpler, which is presumably the use case that this is aimed at. — user295691, Oct 14 '15 at 18:02

jeremycg · Accepted Answer · 2015-10-14T18:25:30.603

Here's a solution using dplyr and all.is.numeric from Hmisc:

library(dplyr)
library(Hmisc)
library(tidyr)
dat %>% separate(code, into=c("code1", "code2")) %>%
        rowwise %>%
        mutate(lists = ifelse(all.is.numeric(c(code1, code2)),
                         list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))),
                         list(code1))) %>%
        unnest(lists) %>%
        select(code = lists, label)

Source: local data frame [5 x 2]

   code             label
  (chr)            (fctr)
1 61000   excision of CNS
2 61001   excision of CNS
3 61002   excision of CNS
4 61003   excision of CNS
5 0169T ventricular shunt

An edit to fix ranges with character values. Brings down the simplicity a little:

dff %>% mutate(row = row_number()) %>%
        separate(code, into=c("code1", "code2")) %>%
        group_by(row) %>%
        summarise(lists = if(all.is.numeric(c(code1, code2)))
                              {list(str_pad(as.character(
                                   seq(from = as.numeric(code1), to = as.numeric(code2))),
                                         nchar(code1), pad="0"))}
                          else if(grepl("^[0-9]", code1))
                              {list(str_pad(paste0(as.character(
                                   seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                      strsplit(code1, "[0-9]+")[[1]][2]),
                                         nchar(code1), pad = "0"))}
                          else
                              {list(paste0(
                                      strsplit(code1, "[0-9]+")[[1]],
                                      str_pad(as.character(
                                    seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                         nchar(gsub("[^0-9]", "", code1)), pad="0")))},
                   label = first(label)) %>%
        unnest(lists) %>%
        select(-row)
Source: local data frame [15 x 2]

               label lists
               (chr) (chr)
1    excision of CNS 61000
2    excision of CNS 61001
3    excision of CNS 61002
4  ventricular shunt 0169T
5  ventricular shunt 0170T
6  ventricular shunt 0171T
7    excision of CNS 01000
8    excision of CNS 01001
9    excision of CNS 01002
10    some procedure A2543
11    some procedure A2544
12    some procedure A2545
13    some procedure A0543
14    some procedure A0544
15    some procedure A0545

data:

dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", 
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", 
"excision of CNS", "some procedure", "some procedure")), .Names = c("code", 
"label"), row.names = c(NA, 5L), class = "data.frame")

This looks good. But it omits codes like "0169T" in the final output. — x1carbon, Oct 14 '15 at 02:48
This solution is very close but still misses those code where the letter comes first. For example, code "A4245" does not get added to the final database. — x1carbon, Oct 14 '15 at 16:19

Rich Scriven · Answer 2 · 2015-10-14T18:43:19.803

Original Answer: See below for update.

First, I made your example data a little more challenging by adding the first row to the bottom.

dff <- structure(list(code = c("61000-61003", "0169T-0169T", "61000-61003"
), label = c("excision of CNS", "ventricular shunt", "excision of CNS"
)), .Names = c("code", "label"), row.names = c(NA, 3L), class = "data.frame")

dff
#          code             label
# 1 61000-61003   excision of CNS
# 2 0169T-0169T ventricular shunt
# 3 61000-61003   excision of CNS

We can use the sequence operator : to get the sequences for the code column, wrapping with tryCatch() so we can avoid an error on, and save the values that cannot be sequenced. First we split the values by the dash mark - then run it through lapply().

xx <- lapply(
    strsplit(dff$code, "-", fixed = TRUE), 
    function(x) tryCatch(x[1]:x[2], warning = function(w) x)
)
data.frame(code = unlist(xx), label = rep(dff$label, lengths(xx)))
#     code             label
# 1  61000   excision of CNS
# 2  61001   excision of CNS
# 3  61002   excision of CNS
# 4  61003   excision of CNS
# 5  0169T ventricular shunt
# 6  0169T ventricular shunt
# 7  61000   excision of CNS
# 8  61001   excision of CNS
# 9  61002   excision of CNS
# 10 61003   excision of CNS

We're trying to apply the sequence operator : to each element from strsplit(), and if taking x[1]:x[2] is not possible then this returns just the values for those elements and proceeds with the sequence x[1]:x[2] otherwise. Then we just replicate the values of the label column based on the resulting lengths in xx to get the new label column.

Update: Here is what I've come up with in response to your edit. Replace xx above with

xx <- lapply(strsplit(dff$code, "-", TRUE), function(x) {
    s <- stringi::stri_locate_first_regex(x, "[A-Z]")
    nc <- nchar(x)[1L]
    fmt <- function(n) paste0("%0", n, "d")
    if(!all(is.na(s))) {
        ss <- s[1,1]
        fmt <- fmt(nc-1)
        if(ss == 1L) {
            xx <- substr(x, 2, nc)
            paste0(substr(x, 1, 1), sprintf(fmt, xx[1]:xx[2]))
        } else {
            xx <- substr(x, 1, ss-1)
            paste0(sprintf(fmt, xx[1]:xx[2]), substr(x, nc, nc))
        }
    } else {
        sprintf(fmt(nc), x[1]:x[2])
    }
})

Yep, it's complicated. Now if we take the following data frame df2 as a test case

df2 <- structure(list(code = c("61000-61003", "0169T-0174T", "61000-61003", 
"T0169-T0174"), label = c("excision of CNS", "ventricular shunt", 
"excision of CNS", "ventricular shunt")), .Names = c("code", 
"label"), row.names = c(NA, 4L), class = "data.frame")

and run the xx code from above on it, we can get the following result.

data.frame(code = unlist(xx), label = rep(df2$label, lengths(xx)))
#     code             label
# 1  61000   excision of CNS
# 2  61001   excision of CNS
# 3  61002   excision of CNS
# 4  61003   excision of CNS
# 5  0169T ventricular shunt
# 6  0170T ventricular shunt
# 7  0171T ventricular shunt
# 8  0172T ventricular shunt
# 9  0173T ventricular shunt
# 10 0174T ventricular shunt
# 11 61000   excision of CNS
# 12 61001   excision of CNS
# 13 61002   excision of CNS
# 14 61003   excision of CNS
# 15 T0169 ventricular shunt
# 16 T0170 ventricular shunt
# 17 T0171 ventricular shunt
# 18 T0172 ventricular shunt
# 19 T0173 ventricular shunt
# 20 T0174 ventricular shunt

This works great. But the input data has codes like "'0005T-0006T". In this case only 0005T gets labeled in the final output but code 0006T is missing. — x1carbon, Oct 14 '15 at 02:51
My apologies, the data-set was big and I missed it. Yes, i would like to have both the codes in the final output. — x1carbon, Oct 14 '15 at 02:54
Not sure if your example is possible. I'm guessing each label only shows up once in the raw data. — Frank, Oct 14 '15 at 18:01

Frank · Answer 3 · 2015-10-14T18:52:34.720

Create a sequencing rule for such codes:

seq_code <- function(from,to){

    ext = function(x, part) gsub("([^0-9]?)([0-9]*)([^0-9]?)", paste0("\\",part), x)

    pre = unique(sapply(list(from,to), ext, part = 1 ))
    suf = unique(sapply(list(from,to), ext, part = 3 ))

    if (length(pre) > 1 | length(suf) > 1){
        return("NO!")
    }

    num = do.call(seq, lapply(list(from,to), function(x) as.integer(ext(x, part = 2))))
    len = nchar(from)-nchar(pre)-nchar(suf)

    paste0(pre, sprintf(paste0("%0",len,"d"), num), suf)

}

With @jeremycg's example:

setDT(dff)[,.(
  label = label[1], 
  code  = do.call(seq_code, tstrsplit(code,'-'))
), by=.(row=seq(nrow(dff)))]

which gives

    row             label  code
 1:   1   excision of CNS 61000
 2:   1   excision of CNS 61001
 3:   1   excision of CNS 61002
 4:   2 ventricular shunt 0169T
 5:   2 ventricular shunt 0170T
 6:   2 ventricular shunt 0171T
 7:   3   excision of CNS 01000
 8:   3   excision of CNS 01001
 9:   3   excision of CNS 01002
10:   4    some procedure A2543
11:   4    some procedure A2544
12:   4    some procedure A2545
13:   5    some procedure A0543
14:   5    some procedure A0544
15:   5    some procedure A0545

Data copied from @jeremycg's answer:

dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", 
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", 
"excision of CNS", "some procedure", "some procedure")), .Names = c("code", 
"label"), row.names = c(NA, 5L), class = "data.frame")

eddi · Answer 4 · 2015-10-15T14:12:55.770

If you're patient enough, you'd probably parse the strings into separate pieces instead of the eval/parse trick, alas I'm not, so:

fancy.seq = function(x) eval(parse(text=sub(', \\)', ')', sub('\\(, ', '(',
               sub('.*?([0-9]+)(.*)-(.*?)([1-9][0-9]*).*',
                   'paste0("\\3",
                           formatC(\\1:\\4, width=log10(\\4)+1, format="d", flag="0"),
                           "\\2")',
                   x)))))
# using example from jeremycg's answer
dt[, .(fancy.seq(code), label), by = 1:nrow(dt)]
#    nrow    V1             label
# 1:    1 61000   excision of CNS
# 2:    1 61001   excision of CNS
# 3:    1 61002   excision of CNS
# 4:    2 0169T ventricular shunt
# 5:    2 0170T ventricular shunt
# 6:    2 0171T ventricular shunt
# 7:    3 01000   excision of CNS
# 8:    3 01001   excision of CNS
# 9:    3 01002   excision of CNS
#10:    4 A2543    some procedure
#11:    4 A2544    some procedure
#12:    4 A2545    some procedure
#13:    5 A0543    some procedure
#14:    5 A0544    some procedure
#15:    5 A0545    some procedure

If unclear what the above is doing - just run the sub commands one by one on one of the "code" strings.

score 1 · Answer 5 · answered Oct 14 '15 at 00:57

A less elegant way to do it:

# the data
hcup <- data.frame(code=c("61000-61003", "0169T-0169T"),
                   label=c("excision of CNS", "ventricular shunt"), stringsAsFactors = F)
hcup
>         code             label
>1 61000-61003   excision of CNS
>2 0169T-0169T ventricular shunt

# reshaping
# split the code ranges into separate columns
seq.ends <- cbind(do.call(rbind.data.frame, strsplit(hcup$code, "-")), hcup$label)
# create a list with a data.frame for each original line
new.list <- apply(seq.ends, 1, FUN=function(x){data.frame(code=if(grepl("\\d{5}", x[1])){
                     z<-x[1]:x[2]}else{z<-x[1]}, label=rep(x[3], length(z)),
                     stringsAsFactors = F)})
# collapse the list into a df
new.df <- do.call(rbind, lapply(new.list, data.frame, stringsAsFactors=F))

new.df
>     code             label
>1.1 61000   excision of CNS
>1.2 61001   excision of CNS
>1.3 61002   excision of CNS
>1.4 61003   excision of CNS
>2   0169T ventricular shunt

Splitting a single column into multiple observation using R

5 Answers5

Linked