2

I have a dataset that I'm trying to work with where I need to get the text between two pipe delimiters. The length of the text is variable so I can't use length to get it. This is the string:

ENST00000000233.10|ENSG00000004059.11|OTTHUMG000

I want to get the text between the first and second pipes, that being ENSG00000004059.11. I've tried several different regex expressions, but I can't really figure out the correct syntax. What should the correct regex expression be?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Ben Tanner
  • 31
  • 3

4 Answers4

3

Here is a regex.

x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
sub("^[^\\|]*\\|([^\\|]+)\\|.*$", "\\1", x)
#> [1] "ENSG00000004059.11"

Created on 2022-05-03 by the reprex package (v2.0.1)

Explanation:

  • ^ beginning of string;
  • [^\\|]* not the pipe character zero or more times;
  • \\| the pipe character needs to be escaped since it's a meta-character;
  • ^[^\\|]*\\| the 3 above combined mean to match anything but the pipe character at the beginning of the string zero or more times until a pipe character is found;
  • ([^\\|]+) group match anything but the pipe character at least once;
  • \\|.*$ the second pipe plus anything until the end of the string.

Then replace the 1st (and only) group with itself, "\\1", thus removing everything else.

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
3

Another option is to get the second item after splitting the string on |.

x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"

strsplit(x, "\\|")[[1]][[2]]
# strsplit(x, "[|]")[[1]][[2]]

# [1] "ENSG00000004059.11"

Or with tidyverse:

library(tidyverse)

str_split(x, "\\|") %>% map_chr(`[`, 2)

# [1] "ENSG00000004059.11"
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
2

Maybe use the regex for look ahead and look behind to extract strings that are surrounded by two "|".

The regex literally means - look one or more characters (.+?) behind "|" ((?<=\\|)) until one character before "|" ((?=\\|)).

library(stringr)

x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
str_extract(x, "(?<=\\|).+?(?=\\|)")

[1] "ENSG00000004059.11"
benson23
  • 16,369
  • 9
  • 19
  • 38
0

Try this: \|.*\| or in R \\|.*\\| since you need to escape the escape characters. (It's just escaping the first pipe followed by any character (.) repeated any number of times (*) and followed by another escaped pipe).

Then wrap in str_sub(MyString, 2, -2) to get rid of the pipes if you don't want them.

alexrai93
  • 266
  • 2
  • 6