6

I need help figuring out how to split strings in a column of a data frame based on the last delimiter when I have varying numbers of the same delimiter in R. For example,

col1 <- c('a', 'b', 'c')
col2 <- c('a_b', 'a_b_c', 'a_b_c_d')
df <- data.frame(cbind(col1, col2))

And I would like to split df$col2 to have a data frame that looks like:

col1 <- c('a', 'b', 'c')
col2 <- c('a', 'a_b', 'a_b_c')
col3 <- c('b', 'c', 'd')
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
user42485
  • 751
  • 2
  • 9
  • 19
  • 1
    As an aside, don't ever do `data.frame(cbind(...` unless you want your life to be difficult. That creates a matrix first, then a data.frame and changes everything to one type (numbers to characters for instance). Just `data.frame(...` will do. – thelatemail Dec 07 '16 at 22:46
  • Thank you @thelatemail. I'm obviously learning, so every piece of advice helps. – user42485 Dec 07 '16 at 22:48
  • 2
    Possible duplicate questions too - http://stackoverflow.com/questions/24938616/string-split-on-last-comma-in-r and http://stackoverflow.com/questions/31774086/extracting-text-after-last-period-in-string-in-r – thelatemail Dec 07 '16 at 22:49
  • None of those really have a good answer for this question. – G. Grothendieck Dec 07 '16 at 23:52

3 Answers3

5

Using the stringi package, you can also achieve your goal.stri_extract_last_regex() extract the last element of what you specify in a pattern. Here, I said "get the last small letter in a string." Likewise, you can use stri_replace_last_regex() to modify col2. Here I said "I want to replace the last pattern of _ and a small letter with nothing." That is, I said "I want to remove the last pattern of _ and a small letter."

library(dplyr)
library(stringi)

df %>%
mutate(col3 = stri_extract_last_regex(str = col2, pattern = "[a-z]"),
       col2 = stri_replace_last_regex(str = col2, pattern = "_[a-z]", replacement = ""))

#  col1  col2 col3
#1    a     a    b
#2    b   a_b    c
#3    c a_b_c    d
jazzurro
  • 23,179
  • 35
  • 66
  • 76
2

These use no packages. They assume that each element of col2 has at least one underscore. (See note if lifting this restriction is needed.)

1) The first regular expression (.*)_ matches everything up to the last underscore followed by everything remaining .* and the first sub replaces the entire match with the matched part within parens. This works because such matches are greedy so the first .* will take everything it can leaving the rest for the second .* . The second regular expression matches everything up to the last underscore and the second sub replaces that with the empty string.

transform(df, col2 = sub("(.*)_.*", "\\1", col2), col3 = sub(".*_", "", col2))

2) Here is a variation that is a bit more symmetric. It uses the same regular expression for both sub calls.

pat <- "(.*)_(.*)"
transform(df, col2 = sub(pat, "\\1", col2), col3 = sub(pat, "\\2", col2))

Note: If we did want to handle strings with no underscore at all such that "xyz" is split into "xyz" and "" then use this for the second sub. It tries to match the left hand side of the | first and if that fails (which will occur if there are no underscores) then the entire string will match the right hand side and sub will replace that with the empty string.

sub(".*_|^[^_]*$", "", col2)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you @GGrothendieck, that works perfectly! [Though, it will take me a while to figure out what it all means.] – user42485 Dec 07 '16 at 23:05
2

A strsplit solution:

spl <- strsplit(as.character(df$col2), "_")

sapply(lapply(spl, head, -1), paste, collapse="_")
#[1] "a"     "a_b"   "a_b_c"
sapply(lapply(spl, tail, 1), paste, collapse="_")
#[1] "b" "c" "d"

Or go full functional crazy:

Map(
  function(spl,ty,n) sapply(spl, function(x) paste(ty(x,n),collapse="_") ),
  list(strsplit(as.character(df$col2), "_")),
  c(head,tail),
  c(-1,1) 
)
#[[1]]
#[1] "a"     "a_b"   "a_b_c"
#
#[[2]]
#[1] "b" "c" "d"
thelatemail
  • 91,185
  • 12
  • 128
  • 188