I have a vector of strings string which look like this
ABC_EFG_HIG_ADF_AKF_MNB
Now from each of this element I want to extract the 3rd set of strings(from left) i.e in this case HIG. How can I achieve this in R
I have a vector of strings string which look like this
ABC_EFG_HIG_ADF_AKF_MNB
Now from each of this element I want to extract the 3rd set of strings(from left) i.e in this case HIG. How can I achieve this in R
substr
extracts a substring by position:
substr('ABC_EFG_HIG_ADF_AKF_MNB', 9, 11)
returns
[1] "HIG"
Here's one more possibility:
strsplit(str1,"_")[[1]][3]
#[1] "HIG"
The command strsplit()
does what its name suggests: it splits a string. The second parameter is the character on which the string is split, wherever it is found within the string.
Perhaps somewhat surprisingly, strsplit()
returns a list. So we can either use unlist()
to access the resulting split parts of the original string, or in this case address them with the index of the list [[1]]
since the list in this example has only one member, which consists of six character strings (cf. the output of str(strsplit(str1,"_"))
).
To access the third entry of this list, we can specify [3]
at the end of the command.
The string str1
is defined here as in the answer by @akrun.
We can use sub
. We match one or more characters that are not _
([^_]+
) followed by a _
. Keep it in a capture group. As we wants to extract the third set of non _
characters, we repeat the previously enclosed group 2 times ({2}
) followed by another capture group of one or more non _
characters, and the rest of the characters indicated by .*
. In the replacement, we use the backreference for the second capture group (\\2
).
sub("^([^_]+_){2}([^_]+).*", "\\2", str1)
#[1] "HIG"
Or another option is with scan
scan(text=str1, sep="_", what="", quiet=TRUE)[3]
#[1] "HIG"
A similar option as mentioned by @RHertel would be to use read.table/read.csv
on the string
read.table(text=str1,sep = "_", stringsAsFactors=FALSE)[,3]
str1 <- "ABC_EFG_HIG_ADF_AKF_MNB"
If you know the place of the pattern you look for, and you know that it is fixed (here, between the 9 and 11 character), you can simply use str_sub(), from the stringr package.
MyString = 'ABC_EFG_HIG_ADF_AKF_MNB'
str_sub(MyString, 9, 11)
A new option is using the function str_split_i
from the development version stringr which can also extract a string by position split by a certain string. Here is a reproducible example:
# devtools::install_github("tidyverse/stringr")
library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 3)
#> [1] "HIG"
Created on 2022-09-10 with reprex v2.0.2
As you can see it extracted the third string. If you want the 6th you can change the 3 with 6 like this:
library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 6)
#> [1] "MNB"
Created on 2022-09-10 with reprex v2.0.2