10

I have a column containing values of 3 strings separated by semicolons. I need to just extract the part of the string which comes before the first semicolon.

Type <- c("SNSR_RMIN_PSX150Y_CSH;SP_12;I0.00V50HX0HY3000")

What I want is: Get the first part of the string (till the first semicolon).

Desired output : SNSR_RMIN_PSX150Y_CSH

I tried gsub without success.

Henrik
  • 65,555
  • 14
  • 143
  • 159
Sharath
  • 2,225
  • 3
  • 24
  • 37

4 Answers4

14

You could try sub

sub(';.*$','', Type)
#[1] "SNSR_RMIN_PSX150Y_CSH"

It will match the pattern i.e. first occurence of ; to the end of the string and replace with ''

Or use

library(stringi)
stri_extract(Type, regex='[^;]*')
#[1] "SNSR_RMIN_PSX150Y_CSH"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks. Works fast even with large data sets. – Sharath Apr 20 '15 at 15:49
  • @Sharath No problem. I think `stringi` should be more fast. Updated with one option, please check if that is fast – akrun Apr 20 '15 at 15:51
  • 1
    I think your regex in `stri_extract` should be `'^[^;]*'`, to make it explicit that you want the first set of characters before the `;`... – StevieP Apr 20 '15 at 21:34
  • @StevieP Thanks, I thought about changing it, but then TylerRinker also posted a solution with that, so I left it like that. But, I am not sure whether this can fail in any situations. – akrun Apr 21 '15 at 05:17
  • @akrun Coud you explain the difference between the * or + symbol in your regex? You used * but below he uses +. – user3594490 Feb 11 '20 at 23:03
  • @user3594490 the `*` is for zero or more occurrene and `+` for one or more occurrence – akrun Feb 11 '20 at 23:05
9

The stringi package works very fast here:

stri_extract_first_regex(Type, "^[^;]+")
## [1] "SNSR_RMIN_PSX150Y_CSH"

I benchmarked on the 3 main approaches here:

Unit: milliseconds
      expr       min        lq      mean   median        uq      max neval
  SAPPLY() 254.88442 267.79469 294.12715 277.4518 325.91576 419.6435   100
     SUB() 182.64996 186.26583 192.99277 188.6128 197.17154 237.9886   100
 STRINGI()  89.45826  91.05954  94.11195  91.9424  94.58421 124.4689   100

enter image description here Here's the code for the Benchmarks:

library(stringi)
SAPPLY <- function() sapply(strsplit(Type, ";"), "[[", 1)
SUB <- function() sub(';.*$','', Type)
STRINGI <- function() stri_extract_first_regex(Type, "^[^;]+")

Type <- c("SNSR_RMIN_PSX150Y_CSH;SP_12;I0.00V50HX0HY3000")
Type <- rep(Type, 100000)

library(microbenchmark)
microbenchmark( 
    SAPPLY(),
    SUB(),
    STRINGI(),
times=100L)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • I missed akrun's edit (about the same as my approach) I'll leave for the benchmark's – Tyler Rinker Apr 20 '15 at 16:08
  • Thanks Tyler. the 'stringi' library is fast and I am going to use that instead of sub. – Sharath Apr 20 '15 at 16:13
  • Would it make a difference in using `stri_extract_first_regex` vs `stri_extract` and specifying the regex inside. – akrun Apr 20 '15 at 16:14
  • I am not sure but the documentation explicitly states: "stri_extract, stri_extract_all, stri_extract_first, and stri_extract_last are convenience functions. They just call stri_extract_*_*, depending on arguments used. Unless you are a very lazy person, please call the underlying functions directly for better performance." – Tyler Rinker Apr 20 '15 at 16:21
3

you can also use strsplit

strsplit(Type, ";")[[1]][1]
[1] "SNSR_RMIN_PSX150Y_CSH"
Mamoun Benghezal
  • 5,264
  • 7
  • 28
  • 33
1

When performance is important you can use substr in combination with regexpr from base.

substr(Type, 1, regexpr(";", Type, fixed=TRUE)-1)
#[1] "SNSR_RMIN_PSX150Y_CSH"

Timings: (Reusing the part from @tyler-rinker)

library(stringi)
SAPPLY <- function() sapply(strsplit(Type, ";"), "[[", 1)
SUB <- function() sub(';.*$','', Type)
SUB2 <- function() sub(';.*','', Type)
SUB3 <- function() sub('([^;]*).*','\\1', Type)
STRINGI <- function() stri_extract_first_regex(Type, "^[^;]+")
STRINGI2 <- function() stri_extract_first_regex(Type, "[^;]*")
SUBSTRREG <- function() substr(Type, 1, regexpr(";", Type)-1)
SUBSTRREG2 <- function() substr(Type, 1, regexpr(";", Type, fixed=TRUE)-1)
SUBSTRREG3 <- function() substr(Type, 1, regexpr(";", Type, fixed=TRUE, useBytes = TRUE)-1)

Type <- c("SNSR_RMIN_PSX150Y_CSH;SP_12;I0.00V50HX0HY3000")
Type <- rep(Type, 100000)

library(microbenchmark)
microbenchmark(SAPPLY(), SUB(), SUB2(), SUB3(), STRINGI()
 , STRINGI2(), SUBSTRREG(), SUBSTRREG2(), SUBSTRREG3())
#Unit: milliseconds
#         expr       min        lq      mean    median        uq       max neval
#     SAPPLY() 382.23750 395.92841 412.82508 410.05236 427.58816 460.28508   100
#        SUB() 111.92120 114.28939 116.41950 115.57371 118.15573 123.92400   100
#       SUB2()  94.27831  96.50462  98.14741  97.38199  99.15260 119.51090   100
#       SUB3() 167.77139 172.51271 175.07144 173.83121 176.27710 190.97815   100
#    STRINGI()  38.27645  39.33428  39.94134  39.71842  40.50182  42.55838   100
#   STRINGI2()  38.16736  39.19250  40.14904  39.63929  40.37686  56.03174   100
#  SUBSTRREG()  45.04828  46.39867  47.13018  46.85465  47.71985  51.07955   100
# SUBSTRREG2()  10.67439  11.02963  11.29290  11.12222  11.43964  13.64643   100
# SUBSTRREG3()  10.74220  10.95139  11.39466  11.06632  11.46908  27.72654   100
GKi
  • 37,245
  • 2
  • 26
  • 48