6

I have a vector of data in the form ‘aaa_9999_1’ where the first part is an alpha-location code, the second is the four digit year, and the final is a unique point identifier. E.g., there are multiple sil_2007_X points, each with a different last digit. I need to split this field, using the “_” character and save only the unique ID number into a new vector. I tried:

oss$point <- unlist(strsplit(oss$id, split='_', fixed=TRUE))[3]

based on a response here: R remove part of string. I get a single response of “1”. If I just run

strsplit(oss$id, split= ‘_’, fixed=TRUE)

I can generate the split list:

> head(oss$point)
[[1]]
[1] "sil"  "2007" "1"   

[[2]]
[1] "sil"  "2007" "2"   

[[3]]
[1] "sil"  "2007" "3"   

[[4]]
[1] "sil"  "2007" "4"   

[[5]]
[1] "sil"  "2007" "5"   

[[6]]
[1] "sil"  "2007" "6"  

Adding the [3] at the end just gives me the [[3]] result: “sil” “2007” “3”. What I want is a vector of the 3rd part (the unique number) of all records. I feel like I’m close to understanding this, but it is taking too much time (like most of a day) on a deadline project. Thanks for any feedback.

Community
  • 1
  • 1
A.Birdman
  • 161
  • 1
  • 2
  • 12

3 Answers3

16

strsplit creates a list, so I would try the following:

lapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a list
sapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a vector (even though a list is also a vector)

The [ means to extract the third element. If you prefer a vector, substitute lapply with sapply.

Here's an example:

mystring <- c("A_B_C", "D_E_F")

lapply(strsplit(mystring, "_"), `[`, 3)
# [[1]]
# [1] "C"
# 
# [[2]]
# [1] "F"
sapply(strsplit(mystring, "_"), `[`, 3)
# [1] "C" "F"

If there is an easily definable pattern, gsub might be a good option too, and avoids splitting. See the comments for improved (more robust) versions along the same lines from DWin and Josh O'Brien.

gsub(".*_.*_(.*)", "\\1", mystring)
# [1] "C" "F"

And, finally, just for fun, you can expand on the unlist approach to make it work by recycling a vector of TRUEs and FALSEs to extract every third item (since we know in advance that all the splits will result in an identical structure).

unlist(strsplit(mystring, "_"), use.names = FALSE)[c(FALSE, FALSE, TRUE)]
# [1] "C" "F"

If you're extracting not by numeric position, but just looking to extract the last value after a delimiter, you have a few different alternatives.

Use a greedy regex:

gsub(".*_(.*)", "\\1", mystring)
# [1] "C" "F"

Use a convenience function like stri_extract* from the "stringi" package:

library(stringi)
stri_extract_last_regex(mystring, "[A-Z]+")
# [1] "C" "F"
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    I like `gsub()` here, and might just do `gsub(".*_.*_", "", mystring)` or even (because regex matching is by default greedy) `gsub(".*_", "", mystring)` – Josh O'Brien Oct 16 '13 at 18:22
  • I would imaging that adding "^" to the beginning of that pattern would ensure that you get the third item rather than the last of many items. Regex pattern interpretations are greedy. `mystring <- c("A_B_C_D_E_F"); gsub(".*_.*_(.*)", "\\1", mystring)` ... returns `[1] "F"` – IRTFM Oct 16 '13 at 18:23
  • 1
    This is the pattern I found ensured the third (non-"_") item: `"^[^_]+_[^_]+_([^_]+)_.*"` – IRTFM Oct 16 '13 at 18:27
  • @DWin, I definitely agree with you that a safer approach should be taken or that the OP should make sure they understand what they are doing if they are going the `gsub` route, but judging by their description and their sample output from `strsplit`, the pattern is pretty predictable (in which case, I like Josh's comment-answer better than mine). Thanks for the alternatives :) – A5C1D2H2I1M1N2O1R2T1 Oct 16 '13 at 18:29
  • 1
    @DWin -- Neat. Alternatively, we could use `?` to make pieces of the regex non-greedy, like this: `gsub(".*?_.*?_(.*?)_.*", "\\1", mystring)`, and something like this will work to get the 5th element: `gsub("(.*?_){4}(.*?)_.*", "\\2", mystring)`. – Josh O'Brien Oct 16 '13 at 18:59
0

Is this what you need?

x = c('aaa_9999_12', 'bbb_9999_20')
ids = sapply(x, function(v){strsplit(v, '_')[[1]][3]}, USE.NAMES = FALSE)

# optional
# ids = as.numeric(ids)

This is VERY inefficient, there's probably a better way.

Fernando
  • 7,785
  • 6
  • 49
  • 81
0

Since stringr 1.5.0, str_split_i is available. This function allows one to acess the ith element of a string split.

x <- c('aaa_9999_12', 'bbb_9999_20')
str_split_i(x, '_', 3)
#[1] "12" "20"