11

I have a vector of strings string which look like this

ABC_EFG_HIG_ADF_AKF_MNB

Now from each of this element I want to extract the 3rd set of strings(from left) i.e in this case HIG. How can I achieve this in R

zx8754
  • 52,746
  • 12
  • 114
  • 209
Rajarshi Bhadra
  • 1,826
  • 6
  • 25
  • 41

5 Answers5

18

substr extracts a substring by position:

substr('ABC_EFG_HIG_ADF_AKF_MNB', 9, 11)

returns

[1] "HIG"
alistaire
  • 42,459
  • 4
  • 77
  • 117
10

Here's one more possibility:

strsplit(str1,"_")[[1]][3]
#[1] "HIG"

The command strsplit() does what its name suggests: it splits a string. The second parameter is the character on which the string is split, wherever it is found within the string.

Perhaps somewhat surprisingly, strsplit() returns a list. So we can either use unlist() to access the resulting split parts of the original string, or in this case address them with the index of the list [[1]] since the list in this example has only one member, which consists of six character strings (cf. the output of str(strsplit(str1,"_"))). To access the third entry of this list, we can specify [3] at the end of the command.

The string str1 is defined here as in the answer by @akrun.

RHertel
  • 23,412
  • 5
  • 38
  • 64
  • 1
    Was about to post the same, but slightly different: `strsplit(str1,"_")[[c(1,3)]]`, just to show what a vector does inside `[[`. – nicola Mar 02 '16 at 17:30
7

We can use sub. We match one or more characters that are not _ ([^_]+) followed by a _. Keep it in a capture group. As we wants to extract the third set of non _ characters, we repeat the previously enclosed group 2 times ({2}) followed by another capture group of one or more non _ characters, and the rest of the characters indicated by .*. In the replacement, we use the backreference for the second capture group (\\2).

sub("^([^_]+_){2}([^_]+).*", "\\2", str1)
#[1] "HIG"

Or another option is with scan

scan(text=str1, sep="_", what="", quiet=TRUE)[3]
#[1] "HIG"

A similar option as mentioned by @RHertel would be to use read.table/read.csv on the string

 read.table(text=str1,sep = "_", stringsAsFactors=FALSE)[,3]

data

str1 <- "ABC_EFG_HIG_ADF_AKF_MNB"
akrun
  • 874,273
  • 37
  • 540
  • 662
6

If you know the place of the pattern you look for, and you know that it is fixed (here, between the 9 and 11 character), you can simply use str_sub(), from the stringr package.

MyString = 'ABC_EFG_HIG_ADF_AKF_MNB'
str_sub(MyString, 9, 11)
Rtist
  • 3,825
  • 2
  • 31
  • 40
2

A new option is using the function str_split_i from the development version stringr which can also extract a string by position split by a certain string. Here is a reproducible example:

# devtools::install_github("tidyverse/stringr")
library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 3)
#> [1] "HIG"

Created on 2022-09-10 with reprex v2.0.2

As you can see it extracted the third string. If you want the 6th you can change the 3 with 6 like this:

library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 6)
#> [1] "MNB"

Created on 2022-09-10 with reprex v2.0.2

Quinten
  • 35,235
  • 5
  • 20
  • 53