tidyr: separate column while retaining delimiter in the first column

Question

I have a column that I am trying to break up into two while retaining the delimiter. I got this far, but part of the delimiter is being dropped. I also need to do this split a second time, adding the delimiter to the first column which I cannot figure out how to do.

duplicates <- data.frame(sample = c("a_1_b1", "a1_2_b1", "a1_c_1_b2"))

duplicates <- separate(duplicates, 
                       sample, 
                       into = c("strain", "sample"),
                       sep = "_(?=[:digit:])")

using only the first name as an example, my output is a_1 and b1 while my desired output is a_1 and _b1.

I would also like to perform this split with the delimiter added to the first column as below.

sample	batch
a_1_	b1
a1_2_	b1
a1_c_1_	b2

Edit: This post does not answer my question of how to retain the delimiter, or to control which side of the split it ends up on.

Although that dupe didn't answer it, there are 100s of questions that are asked with `separate` and I am sure that this was answered previously — akrun, Jul 28 '21 at 17:59

TarJae · Answer 1 · 2021-07-28T18:13:05.117

2

Update: see request of OP in comments:

duplicates %>% 
    mutate(batch = sub(".*_", "_", sample)) %>%  
    mutate(sample = sub("_[^_]+$", "", sample))

output:

  sample batch
1    a_1   _b1
2   a1_2   _b1
3 a1_c_1   _b2

Update after clarification: see comments:

duplicates %>% 
    mutate(batch = sub(".*_", "", sample)) %>%  
    mutate(sample = sub("_[^_]+$", "_", sample))

Output:

   sample batch
1    a_1_    b1
2   a1_2_    b1
3 a1_c_1_    b2

First answer: We could use str_sub from stringr package:

library(stringr)
library(dplyr)

duplicates %>% 
    mutate(batch = str_sub(sample, -2,-1)) %>% 
    mutate(sample = str_sub(sample, end=-3))

Output:

   sample batch
1    a_1_    b1
2   a1_2_    b1
3 a1_c_1_    b2

edited Jul 28 '21 at 18:13

answered Jul 28 '21 at 07:55

TarJae

72,363
6
19
66

what are the numbers indicating? – keenan Jul 28 '21 at 07:57
-2 = last two characters. -1 begin from end of string. -3 last three characters from string. – TarJae Jul 28 '21 at 07:58
I should have provided a more accurate representation of the data. I sometimes have b10 values. – keenan Jul 28 '21 at 08:00
Please see my update. Now it should work more generally! – TarJae Jul 28 '21 at 08:21
How would I edit this to receive the opposite switch (a_1 and _b1)? I am still a little iffy on how to use some of the R expressions. – keenan Jul 28 '21 at 17:51
I don't understand. What should be in sample and what in batch? – TarJae Jul 28 '21 at 17:57
I need to perform the split in two different ways for two different analysis. Once I need `a_1_ & b1` and another time I need `a_1 & _b1` – keenan Jul 28 '21 at 17:59

akrun · Answer 2 · 2021-07-28T18:14:30.460

2

Using separate

library(tidyr)
separate(duplicates, sample, into = c("strain", "sample"), 
        sep = "(?<=_)(?=[^_]+$)")

-output

    strain sample
1    a_1_     b1
2   a1_2_     b1
3 a1_c_1_     b2

For splitting the other way

separate(duplicates, sample, into = c("strain", "sample"), 
         sep = "(?<=[^_])(?=_[^_]+$)")
  strain sample
1    a_1    _b1
2   a1_2    _b1
3 a1_c_1    _b2

edited Jul 28 '21 at 18:14

answered Jul 28 '21 at 18:01

akrun

874,273
37
540
662

score 1 · Accepted Answer · answered Jul 28 '21 at 07:38

1

You can use tidyr::extract with capture groups.

tidyr::extract(duplicates, sample, c("strain", "sample"), '(.*_)(\\w+)')

#   strain sample
#1    a_1_     b1
#2   a1_2_     b1
#3 a1_c_1_     b2

The same regex can also be used with strcapture in base R -

strcapture('(.*_)(\\w+)', duplicates$sample, 
           proto = list(strain = character(), sample = character()))

answered Jul 28 '21 at 07:38

Ronak Shah

377,200
20
156
213

The first solution worked for me. The second created an empty column called sample and assigned the entire string to strain. How would I edit this to achieve the opposite split (a_1 & _b1)? – keenan Jul 28 '21 at 07:48
The second solution seems to work similarly for me for the data that you have shared. – Ronak Shah Jul 28 '21 at 07:50
My actual data is formatted slightly differently so I am guessing that is why it isn't working. The second column isn't actually empty I realized. Using an actual column name not a simplified example I gave it is separating as `LEW_3_batch0` and `1` (this is batch 10) or `LEW_2_batch` and `8` – keenan Jul 28 '21 at 07:55
How can I adapt the first solution to give me the opposite split? – keenan Jul 28 '21 at 07:58
Like this? `tidyr::extract(duplicates, sample, c("strain", "sample"), '(.*?)(_.*)')` – Ronak Shah Jul 28 '21 at 08:00
Thank you. Can you explain how this is causing the split to be on the last _ in the string? I don't fully understand what is happening. – keenan Jul 28 '21 at 08:04
`.*` is greedy meaning it will match everything until last underscore. If you want to match till the first underscore use `?` i.e `(.*?)` as mentioned in the earlier comment. – Ronak Shah Jul 28 '21 at 10:08
In some cases their are 3 underscores and I still need to split on the second (last) but keeping the underscore attached to the second column like `a1_c_1 & _b2` how would I go about this? – keenan Jul 28 '21 at 17:45

tidyr: separate column while retaining delimiter in the first column

3 Answers3