1

What is the most effective way to separate the digits from letters in this example :

       V1 V2
1 p_men_1  1
2 p_men_2  0
3 p_men_3  1
4 p_wom_1  1
5 p_wom_2  1
6 p_wom_3  0

ouput

     V1 V2 V3
1 p_men  1  1
2 p_men  2  0
3 p_men  3  1
4 p_wom  1  1
5 p_wom  2  1
6 p_wom  3  0

I tried

library(tidyr) 
library(dplyr)

df %>% separate(V1, c('V1', 'V2'), sep = '_')

but because of the '_', it doesn't work

  df = rbind(c('p_men_1', 1), 
  c('p_men_2', 0), 
  c('p_men_3', 1), 
  c('p_wom_1', 1), 
  c('p_wom_2', 1), 
  c('p_wom_3', 0))

  df = as.data.frame(df)
giac
  • 4,261
  • 5
  • 30
  • 59
  • 3
    http://stackoverflow.com/questions/4350440/split-a-column-of-a-data-frame-to-multiple-columns `cbind(read.table(text = gsub('_(?=\\d+)', ' ', df$V1, perl = TRUE)), V3 = df[, 2])` – rawr Dec 11 '16 at 16:48
  • [Separating column using separate (tidyr) via dplyr on a first encountered digit](http://stackoverflow.com/questions/34842528/separating-column-using-separate-tidyr-via-dplyr-on-a-first-encountered-digit). Modify the `sep` parameter a little and you should accomplish your results. – timtrice Dec 11 '16 at 16:51

2 Answers2

6

This could work:

df %>% 
    extract(V1, c('V1', 'V2'), regex = '(^.+)_(\\d+)')

#      V1 V2 V2
# 1 p_men  1  1
# 2 p_men  2  0
# 3 p_men  3  1
# 4 p_wom  1  1
# 5 p_wom  2  1
# 6 p_wom  3  0
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • The `tidyr::extract` function appears a lot more intuitive than `strsplit`. Has the added advantage of having a factor.method. – IRTFM Dec 11 '16 at 17:25
2

My strategy was to split on the last underscore, which can be coded by forming a pattern that has an underscore followed by a zero-length look-ahead that require all non-underscores until the end of a character value.

cbind( do.call( rbind, strsplit(as.character(dat$V1), split= '_(?=[^_]+$)', perl=TRUE) ),
       dat['V2'] )
      1 2 V2
1 p_men 1  1
2 p_men 2  0
3 p_men 3  1
4 p_wom 1  1
5 p_wom 2  1
6 p_wom 3  0

Unfortunately, this appears to be a malformed dataframe because despite being recognizes as a dataframe and getting cbind.data.frame to be called, it leaves the column names improperly formed with leading digits.

IRTFM
  • 258,963
  • 21
  • 364
  • 487