2

I have a column that I am trying to break up into two while retaining the delimiter. I got this far, but part of the delimiter is being dropped. I also need to do this split a second time, adding the delimiter to the first column which I cannot figure out how to do.

duplicates <- data.frame(sample = c("a_1_b1", "a1_2_b1", "a1_c_1_b2"))

duplicates <- separate(duplicates, 
                       sample, 
                       into = c("strain", "sample"),
                       sep = "_(?=[:digit:])")

using only the first name as an example, my output is a_1 and b1 while my desired output is a_1 and _b1.

I would also like to perform this split with the delimiter added to the first column as below.

sample batch
a_1_ b1
a1_2_ b1
a1_c_1_ b2

Edit: This post does not answer my question of how to retain the delimiter, or to control which side of the split it ends up on.

keenan
  • 462
  • 3
  • 12
  • Although that dupe didn't answer it, there are 100s of questions that are asked with `separate` and I am sure that this was answered previously – akrun Jul 28 '21 at 17:59

3 Answers3

2
  1. Update: see request of OP in comments:
duplicates %>% 
    mutate(batch = sub(".*_", "_", sample)) %>%  
    mutate(sample = sub("_[^_]+$", "", sample))

output:

  sample batch
1    a_1   _b1
2   a1_2   _b1
3 a1_c_1   _b2

Update after clarification: see comments:

duplicates %>% 
    mutate(batch = sub(".*_", "", sample)) %>%  
    mutate(sample = sub("_[^_]+$", "_", sample))

Output:

   sample batch
1    a_1_    b1
2   a1_2_    b1
3 a1_c_1_    b2

First answer: We could use str_sub from stringr package:

library(stringr)
library(dplyr)

duplicates %>% 
    mutate(batch = str_sub(sample, -2,-1)) %>% 
    mutate(sample = str_sub(sample, end=-3))

Output:

   sample batch
1    a_1_    b1
2   a1_2_    b1
3 a1_c_1_    b2
TarJae
  • 72,363
  • 6
  • 19
  • 66
2

Using separate

library(tidyr)
separate(duplicates, sample, into = c("strain", "sample"), 
        sep = "(?<=_)(?=[^_]+$)") 

-output

    strain sample
1    a_1_     b1
2   a1_2_     b1
3 a1_c_1_     b2

For splitting the other way

separate(duplicates, sample, into = c("strain", "sample"), 
         sep = "(?<=[^_])(?=_[^_]+$)")
  strain sample
1    a_1    _b1
2   a1_2    _b1
3 a1_c_1    _b2
akrun
  • 874,273
  • 37
  • 540
  • 662
1

You can use tidyr::extract with capture groups.

tidyr::extract(duplicates, sample, c("strain", "sample"), '(.*_)(\\w+)')

#   strain sample
#1    a_1_     b1
#2   a1_2_     b1
#3 a1_c_1_     b2

The same regex can also be used with strcapture in base R -

strcapture('(.*_)(\\w+)', duplicates$sample, 
           proto = list(strain = character(), sample = character()))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • The first solution worked for me. The second created an empty column called sample and assigned the entire string to strain. How would I edit this to achieve the opposite split (a_1 & _b1)? – keenan Jul 28 '21 at 07:48
  • The second solution seems to work similarly for me for the data that you have shared. – Ronak Shah Jul 28 '21 at 07:50
  • My actual data is formatted slightly differently so I am guessing that is why it isn't working. The second column isn't actually empty I realized. Using an actual column name not a simplified example I gave it is separating as `LEW_3_batch0` and `1` (this is batch 10) or `LEW_2_batch` and `8` – keenan Jul 28 '21 at 07:55
  • How can I adapt the first solution to give me the opposite split? – keenan Jul 28 '21 at 07:58
  • Like this? `tidyr::extract(duplicates, sample, c("strain", "sample"), '(.*?)(_.*)')` – Ronak Shah Jul 28 '21 at 08:00
  • Thank you. Can you explain how this is causing the split to be on the last _ in the string? I don't fully understand what is happening. – keenan Jul 28 '21 at 08:04
  • `.*` is greedy meaning it will match everything until last underscore. If you want to match till the first underscore use `?` i.e `(.*?)` as mentioned in the earlier comment. – Ronak Shah Jul 28 '21 at 10:08
  • In some cases their are 3 underscores and I still need to split on the second (last) but keeping the underscore attached to the second column like `a1_c_1 & _b2` how would I go about this? – keenan Jul 28 '21 at 17:45