3

I have two columns in a much larger dataframe that I am having difficult splitting. I have used strsplit in past when I was trying to split using a "space", "," or some other delimiter. The hard part here is I don't want to lose any information AND when I split some parts I will end up with missing information. I would like to end up with four columns in the end. Here's a sample of a couple rows of what I have now.

age-gen  surv-camp
45M      1LC
9F       0
12M      1AC
67M      1LC

Here is what I would like to ultimately get.

age   gen   surv   camp
45    M     1      LC
9     F     0      
12    M     1      AC
67    M     1      LC

I've done quite a lot of hunting around on here and have found a number of responses in Java, C++, html etc., but I haven't found anything that explains how to do this in R and when you have missing data.

I saw this about adding a space between values and then just splitting on the space, but I don't see how this would work 1) with missing data, 2) when I don't have consistent numeric or character values in each row.

Community
  • 1
  • 1
Sam Marshal
  • 85
  • 2
  • 6

1 Answers1

4

We loop through the columns of 'df1' (lapply(df1, ..), create a delimiter after the numeric substring using sub, read the vector as data.frame with read.table, rbind the list of data.frames and change the column names of the output.

res <- do.call(cbind, lapply(df1, function(x)
      read.table(text=sub("(\\d+)", "\\1,", x), 
          header=FALSE, sep=",", stringsAsFactors=FALSE)))
colnames(res) <- scan(text=names(df1), sep=".", what="", quiet = TRUE)
res
#  age gen surv camp
#1  45   M    1   LC
#2   9   F    0     
#3  12   M    1   AC
#4  67   M    1   LC

Or using separate from tidyr

library(tidyr)
library(dplyr)
separate(df1, age.gen, into = c("age", "gen"), "(?<=\\d)(?=[A-Za-z])", convert= TRUE) %>% 
       separate(surv.camp, into = c("surv", "camp"), "(?<=\\d)(?=[A-Za-z])", convert = TRUE)
#  age gen surv camp
#1  45   M    1   LC
#2   9   F    0 <NA>
#3  12   M    1   AC
#4  67   M    1   LC

Or as @Frank mentioned, we can use tstrsplit from data.table

library(data.table)
setDT(df1)[, unlist(lapply(.SD, function(x) 
    tstrsplit(x, "(?<=[0-9])(?=[a-zA-Z])", perl=TRUE, 
                        type.convert=TRUE)), recursive = FALSE)]

EDIT: Added the convert = TRUE in separate to change the type of columns after the split.

data

df1 <- structure(list(age.gen = c("45M", "9F", "12M", "67M"), surv.camp = c("1LC", 
 "0", "1AC", "1LC")), .Names = c("age.gen", "surv.camp"), 
class = "data.frame", row.names = c(NA, -4L))
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Using `separate` worked out perfectly. Thank you. I haven't been able to adapt the first suggestion and get it to run successful but I will use the logic you wrote out at the beginning and see if I get lucky. – Sam Marshal Sep 10 '16 at 17:23
  • @SamMarshal Perhaps your original dataset have some patterns that is not matching with the one you showed. – akrun Sep 10 '16 at 17:25
  • 1
    Maybe worth showing the data.table way as well, which has the nice `type.convert` feature (not sure if `separate` does): `data.table::tstrsplit(x, "(?<=[0-9])(?=[a-zA-Z])", perl=TRUE, type.convert=TRUE)` – Frank Sep 10 '16 at 17:27
  • 1
    @akrun that was my thought as well, so I'll work though the logic in each step and see if I can find where things are getting stuck. – Sam Marshal Sep 10 '16 at 17:28
  • 1
    @Frank Yes `separate` also have that option, but by default it is `FALSE` – akrun Sep 10 '16 at 17:28