r - separate digits from string

Question

What is the most effective way to separate the digits from letters in this example :

       V1 V2
1 p_men_1  1
2 p_men_2  0
3 p_men_3  1
4 p_wom_1  1
5 p_wom_2  1
6 p_wom_3  0

ouput

     V1 V2 V3
1 p_men  1  1
2 p_men  2  0
3 p_men  3  1
4 p_wom  1  1
5 p_wom  2  1
6 p_wom  3  0

I tried

library(tidyr) 
library(dplyr)

df %>% separate(V1, c('V1', 'V2'), sep = '_')

but because of the '_', it doesn't work

  df = rbind(c('p_men_1', 1), 
  c('p_men_2', 0), 
  c('p_men_3', 1), 
  c('p_wom_1', 1), 
  c('p_wom_2', 1), 
  c('p_wom_3', 0))

  df = as.data.frame(df)

http://stackoverflow.com/questions/4350440/split-a-column-of-a-data-frame-to-multiple-columns `cbind(read.table(text = gsub('_(?=\\d+)', ' ', df$V1, perl = TRUE)), V3 = df[, 2])` — rawr, Dec 11 '16 at 16:48
[Separating column using separate (tidyr) via dplyr on a first encountered digit](http://stackoverflow.com/questions/34842528/separating-column-using-separate-tidyr-via-dplyr-on-a-first-encountered-digit). Modify the `sep` parameter a little and you should accomplish your results. — timtrice, Dec 11 '16 at 16:51

score 6 · Accepted Answer · answered Dec 11 '16 at 16:55

6

This could work:

df %>% 
    extract(V1, c('V1', 'V2'), regex = '(^.+)_(\\d+)')

#      V1 V2 V2
# 1 p_men  1  1
# 2 p_men  2  0
# 3 p_men  3  1
# 4 p_wom  1  1
# 5 p_wom  2  1
# 6 p_wom  3  0

answered Dec 11 '16 at 16:55

Tyler Rinker

108,132
65
322
519

The `tidyr::extract` function appears a lot more intuitive than `strsplit`. Has the added advantage of having a factor.method. – IRTFM Dec 11 '16 at 17:25

IRTFM · Answer 2 · 2016-12-11T17:28:00.383

My strategy was to split on the last underscore, which can be coded by forming a pattern that has an underscore followed by a zero-length look-ahead that require all non-underscores until the end of a character value.

cbind( do.call( rbind, strsplit(as.character(dat$V1), split= '_(?=[^_]+$)', perl=TRUE) ),
       dat['V2'] )
      1 2 V2
1 p_men 1  1
2 p_men 2  0
3 p_men 3  1
4 p_wom 1  1
5 p_wom 2  1
6 p_wom 3  0

Unfortunately, this appears to be a malformed dataframe because despite being recognizes as a dataframe and getting cbind.data.frame to be called, it leaves the column names improperly formed with leading digits.

r - separate digits from string

2 Answers2