4

I have a dataframe such as

COl1
scaffold_97606_2-BACs_-__SP1_1
UELV01165908.1_2-BACs_+__SP2_2
UXGC01046554.1_9-702_+__SP3_3
scaffold_12002_1087-1579_-__SP4_4

and I would like to separate both into two columns and get :

COL1           COL2 
scaffold_97606 2-BACs_-__SP1_1
UELV01165908.1 2-BACs_+__SP2_2
UXGC01046554.1 9-702_+__SP3_3
scaffold_12002 1087-1579_-__SP4_4

so as you can see the separator changes it can be .Number_ or Number_Number

So far I wrote ;

df2 <- df1 %>%
    separate(COL1, paste0('col', 1:2), sep = " the separator patterns ", extra = "merge")

but I do not know what separator I should use here in the " the separator patterns "part

zx8754
  • 52,746
  • 12
  • 114
  • 209
chippycentra
  • 3,396
  • 1
  • 6
  • 24
  • Just for the record: OP attempt is described [here](https://stackoverflow.com/q/62913589/3832970). It is not a mere gimme-teh-codez request. – Wiktor Stribiżew Jul 17 '20 at 05:06

2 Answers2

6

You may use

> df1 %>%
    separate(COl1, paste0('col', 1:2), sep = "(?<=\\d)_(?=\\d+-)", extra = "merge")
            col1               col2
1 scaffold_97606    2-BACs_-__SP1_1
2 UELV01165908.1    2-BACs_+__SP2_2
3 UXGC01046554.1     9-702_+__SP3_3
4 scaffold_12002 1087-1579_-__SP4_4

See the regex demo

Pattern details

  • (?<=\d) - a positive lookbehind that requires a digit immediately to the left of the current location
  • _ - an underscore
  • (?=\d+-) - a positive lookahead that requires one or more digits and then a - immediately to the right of the current location.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
5

You can use extract :

tidyr::extract(df, COl1, c('Col1', 'Col2'), regex = '(.*?\\d+)_(.*)')

#            Col1               Col2
#1 scaffold_97606    2-BACs_-__SP1_1
#2 UELV01165908.1    2-BACs_+__SP2_2
#3 UXGC01046554.1     9-702_+__SP3_3
#4 scaffold_12002 1087-1579_-__SP4_4

data

df <- structure(list(COl1 = c("scaffold_97606_2-BACs_-__SP1_1", 
"UELV01165908.1_2-BACs_+__SP2_2", 
"UXGC01046554.1_9-702_+__SP3_3", "scaffold_12002_1087-1579_-__SP4_4"
)), class = "data.frame", row.names = c(NA, -4L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213