Removing fixed start of a string, and a variable ending

Question

I have been struggling trying to get an efficient strategy to solve this.

Given the following string: strings <- c("PMLR01TR060055PB01", "PMLR01BE080001PD01")

How can I remove the fixed start ("PMLR01") and the variable ending ("PB01" or "PD01"), to have TR060055 and BE080001.

I have a huge number of entries (10000+) and would like to have a efficient strategy to select this for all. Ideally, I would need some strategy to remove everything before the TR or BE, and everything after the numbers of the substring I would like to keep. Like this I would cover all possible angles.

I tried a very naive approach: substr("PMLR01TR060055PB01", 7, 14) But if by any chance one of the strings doesn't match exactly the number of characters, I will have a problem.

`gsub("^PMLR01|[A-Z]{2}[0-9]{2}$", "", strings)` or `gsub(".*((TR|BE)[0-9]+)[^0-9].*", "\\1", strings)` work. See https://stackoverflow.com/a/22944075/3358272 here, and https://regexr.com/, https://regex101.com/ for other regex ideas/explanation. — r2evans, Jun 07 '23 at 12:22
@r2evans These definitely work. Regex certainly seems to be the way to go! But I'll confess, I am quite lost (and I looked a lot for a solution using regex). The only "limitation" with this approach, is that it either limits the prefix to `PMLR01`, or the start of the string of interest to `TR|BE`. I'm looking for a solution that gets rid of the beginning (3 letter and 2numbers) and the ending (combination of 2 letters and 2 numbers, after the string to keep). — milcs40, Jun 07 '23 at 15:17
Actually, exploring your suggestions and the resources you sent, I think I've worked it to: `"^[A-Z]{4}[0-9]{2}|[A-Z]{2}[0-9]{2}$"`. Like this, it gets rid of the start (4 letters and 2 numbers) and the end (2 letters and 2 numbers). — milcs40, Jun 07 '23 at 15:28

Linus Henkel · Answer 1 · 2023-06-07T12:38:03.320

0

In your case, you could use the gsub() function, which allows you to replace patterns in a string. The REGEX pattern you want to match is everything that comes before "TR" or "BE" (including "TR" or "BE") and everything that comes after the numbers.

Here is a way to do it (not tested):

strings <- c("PMLR01TR060055PB01", "PMLR01BE080001PD01")

result <- gsub("^.*?(TR|BE)|[A-Z][A-Z][0-9][0-9]$", "", strings)

print(result)

edited Jun 07 '23 at 12:38

answered Jun 07 '23 at 12:25

Linus Henkel

1
1

Thank you for your help. Unfortunately, using this approach, the results I get are: `060055` and `080001` And I need `TR060055` and `BE080001` Moreover, I didn't explain it well, I think... But the two first letters of the substring I want to keep, can be multiple (not only `TR|BE`). – milcs40 Jun 07 '23 at 15:06

score 0 · Answer 2 · answered Jun 07 '23 at 13:41

0

If you do not want to use regex and are ok with multiple lines of code you can do:

strings <- c("PMLR01TR060055PB01", "PMLR01BE080001PD01")
strings <- gsub("PMLR01", "", strings)
strings <- gsub("PB01", "", strings)
strings <- gsub("PD01", "", strings)
strings
#[1] "TR060055" "BE080001"

answered Jun 07 '23 at 13:41

user1317221_G

15,087
3
52
78

I could do it as suggested, but as I mentioned, I have more than 10000 strings, with different prefixes and sufixes. Moreover, the string I want to select can start with any combination of 2 letters. – milcs40 Jun 07 '23 at 15:09
OK, then i'd suggest making an example that better reflects the actual situation you have, so its easier to solve. – user1317221_G Jun 08 '23 at 07:56

Removing fixed start of a string, and a variable ending

2 Answers2