R - Find & Extract string from right

Question

I have this vector:

data <- structure(1:5, .Label = c("AVE_prQD_AFR_p10", "PER_prVD_DSR_p9", "PA_prSX_AR_p8", 
"prAV_AES_p7", "prGR_AXXR_p6", "prQW_AWAR_p5"), class = "factor")

  V1
1 "AVE_prQD_AFR_p10"
2 "PER_prVD_DSR_p9"
3 "PA_prS_X_AR_p8"
4 "prAV_AES_p7"
5 "prGR_AXXR_p6"

I'm trying to extract the latest characters to the right, specifically from the latest _ to the end of the string. I don't know how many _ are in each string, but I do know that there always will be at least one, and always will be one _ before the part of the string I need. To give you an example:

"any_random_string_0" # "0" is the string I need
"any_random_string_f20" # "f20" is the string I need
"any_random_string_p3" # "p3" is the string I need

As you could infer from the example above, the last part of the string will always start with _, followed by a p, f or 0, and then will end the string with a number from 1 to 99" (except if its 0):

"_" + "f" or "p" or "0" + "1" to "99"

There is NEVER gonna be something after the number. Hence, the full string ends with the string I need. So, looking for a solution I was trying to find (unsuccessfully) some function that search _ from the right.

Plus, I need to transform that string given these conditions:

If the string has p, multiply the number by -1
If the string has f, the number is positive
If the string is _0, give it 0.

This is my attempt, it works, but only with a fixed position of _ and with a number from 0 to 9.

function(some_vector_string){
  result <- stringr::str_sub(some_vector_string, -2,-2) %>% 
    {ifelse(. == "p",
            as.numeric(stringr::str_sub(some_vector_string, -1,-1))*-1,
            ifelse(. == "f",
                   as.numeric(stringr::str_sub(some_vector_string, -1,-1))*1,
                   ifelse(.=="_", 0, -100)))}
  return(result)
}

Yep, pretty sure this has been answered here before a few times. I'll hunt for the duplicate, but I think you can just destroy everything up to the last `_` like `sub(".+_", "", data)` — thelatemail, May 19 '21 at 02:54
I hope you find the duplicate thelatemail because I found none xD. @LMc didn't come to my mind. I will try that one. — Chris, May 19 '21 at 02:59
For the first part of the question maybe - https://stackoverflow.com/questions/31774086/extracting-text-after-last-period-in-string or https://stackoverflow.com/questions/42943533/r-get-last-element-from-str-split or https://stackoverflow.com/questions/37051288/extract-text-after-a-symbol-in-r — thelatemail, May 19 '21 at 03:08

LMc · Accepted Answer · 2021-05-20T14:50:55.493

In base R something like:

a <- sapply(strsplit(as.character(data), "_"), function(x) rev(x)[1])

ifelse(startsWith(a, "0"), 0, c(-1, 1)[startsWith("p", a) + 1] * as.numeric(gsub("\\D*", "", a)))
[1] -10  -9  -8  -7  -6

Or using tidyverse libraries:

library(readr)
library(purrr)
library(stringr)

n <- setNames(c(0, -1, 1), c("0", "p", "f"))

map_dbl(str_remove(data, ".*_"), ~ n[substr(., 1, 1)] * parse_number(.))

[1] -10  -9  -8  -7  -6

How it works

stringr::str_remove(data, ".*_")

As mentioned in the comments by @thelatemail .* is a greedy regular expression that takes zero or more (*) of any character (.). So this removed everything up until the last underscore, hence it's greedy:

str_remove("A_B_C_D", ".*_")
[1] "D"

This is opposed to ungreedy (?), which does not try to match as much as possible:

str_remove("A_B_C_D", ".*?_")
[1] "B_C_D"

purrr::map_dbl

This function iterates over a list or atomic vector and outputs an atomic vector of type double, hence the _dbl.

The ~ is a lambda/purrr style syntax. Up until R 4.1.0 this would be be written in base R as: function(x) n[substr(x, 1, 1)] * parse_number(x). It's just a cleaner, easier-to-read, less wordy way of applying an anonymous function (ie a function that has not been assigned to a name). This is a common syntactic style in the tidyverse. Here the x argument of the function is replaced by the dot notation ..

n[substr(., 1, 1)] * parse_number(.)

substr(., 1, 1) takes the first character of the parsed string:

substr("f20", 1,1 )
[1] "f"

Then it looks up that first character in the named vector n to return the value you specified in the question, which is one in the case when the letter is "f":

n["f"]
f 
1

readr::parse_number extracts all the digits from a string and returns it as numeric:

readr::parse_number("f20")
[1] 20

These two values are multiplied and returned as an element in the double atomic vector output.

Note: this works when the suffix is "0" because the result of this operation is 0*0:

substr("0",1 ,1)
[1] "0"

n["0"]
0 
0 

parse_number("0")
[1] 0

n[substr("0", 1, 1)] * parse_number("0")
0 
0

You'll notice this output is technically a named vector with "0" being the name and the value being 0; however, this is coerced to a double vector by map_dbl.

Why not use the named vector logic in base R too? `n[substr(a, 1, 1)] * as.numeric(gsub("\\D*", "", a))` instead of `ifelse` — thelatemail, May 19 '21 at 03:43
It worked! Could you explain a little bit the logic? Specifically, what does, ´map_dbl´, the string ´.*_´, the symbol ´~´, and the multiplication of ´parse_number(.)´ and its dot do? (I know the dot from the pipe operator and the ´~´ sign for regression formulas, but nothing else). Thanks! — Chris, May 20 '21 at 00:54
@Chris Great! Glad it worked. I've updated my answer with some additional explanation. Hope it helps. — LMc, May 20 '21 at 14:51

jpdugo17 · Answer 2 · 2021-05-19T03:12:16.543

I tried with rebus and this is what I get:

library(tidyverse)
library(rebus)
library(stringr)
data <- structure(1:5, .Label = c("AVE_prQD_AFR_p10", "PER_prVD_DSR_p9", "PA_prSX_AR_p8", 
                                      "prAV_AES_p7", "prGR_AXXR_p6", "prQW_AWAR_p5"), class = "factor")
    
#rebus END regex lets you narrow the search to the last part of the string
chars <-
    str_extract(data , rebus::or(ANY_CHAR %R% one_or_more(DGT) %R% END, '0' %R% END)) 
    
    ##fabricate the conditions
    #map_dbl is also an option to avoid returning a list
numbers <-
    map(chars, ~
            if(str_sub(.x,1 ,1) == 'p'){
                as.numeric(str_extract(.x, one_or_more(DGT))) * -1
            } else{
                if (str_sub(.x,1 ,1) == 'f'){
                    as.numeric(str_extract(.x, one_or_more(DGT)))
                } else {
                    0
                }
                
        })
    
print(numbers)
``

HNSKD · Answer 3 · 2021-05-19T04:08:22.380

Data:

data <- data.frame(V1 = c("AVE_prQD_AFR_p10", "PER_prVD_DSR_p9", "PA_prSX_AR_p8", 
                     "prAV_AES_p7", "prGR_AXXR_p6", "prQW_AWAR_p5",
                     "AVE_prQD_AFR_f10", "PER_prVD_DSR_f9", "PA_prSX_AR_f8", 
                     "prAV_AES_f7", "prGR_AXXR_f6", "prQW_AWAR_f5",
                     "AVE_prQD_AFR_0", "PER_prVD_DSR_0", "PA_prSX_AR_0", 
                     "prAV_AES_0", "prGR_AXXR_0", "prQW_AWAR_0"))

Method 1

We can work on it as follows:

Separate the first portion (containing alphanumeric characters and at least 1 "_") and the second portion (containing only alphanumeric characters) into V2 and V3 respectively
Extract only numeric values from V3 to obtain V4
If V3 contains "p", obtain V5 by taking V4*-1. Others remain the same

Code:

data %>% 
  tidyr::extract(V1, c("V2", "V3"), "([[:alnum:]_]+)_([[:alnum:]]+)$", remove = FALSE) %>% 
  mutate(V4 = readr::parse_number(V3),
         v5 = case_when(stringr::str_detect(V3, "p") ~ V4*-1,
                        TRUE ~ V4))

Method 2

Or perhaps, instead of splitting the string into 2 (as seen in Method 1), we may just extract the portion of interest and work from there.

data %>% 
  mutate(V2 = stringr::str_extract(V1, "[pf]?\\d{1,2}$"),
         V3 = readr::parse_number(V2),
         v4 = case_when(stringr::str_detect(V2, "p") ~ V3*-1,
                        TRUE ~ V3))

score 1 · Answer 4 · answered May 19 '21 at 06:45

Solution for extracting all ending digits and multiply with -1, if they start with an f (positive numbers are ok anyway and 0 has no sign):

library(tidyverse)

data <- structure(1:8, .Label = c("AVE_prQD_AFR_p10", "PER_prVD_DSR_p9", "PA_prSX_AR_p8", 
                              "prAV_AES_p7", "prGR_AXXR_p6", "prQW_AWAR_p5",
                              "prQW_AWAR_0", "prQW_AWAR_f5"), class = "factor")

data %>%
  enframe(name = NULL, value = "V1") %>% # create tibble from vector
  mutate(want = as.numeric(str_extract(V1, "\\d+$")) * if_else(str_detect(V1, "_f\\d+$"), -1, 1))

#V1                want
#<fct>            <dbl>
#  1 AVE_prQD_AFR_p10    10
#2 PER_prVD_DSR_p9      9
#3 PA_prSX_AR_p8        8
#4 prAV_AES_p7          7
#5 prGR_AXXR_p6         6
#6 prQW_AWAR_p5         5
#7 prQW_AWAR_0          0
#8 prQW_AWAR_f5        -5

score 1 · Answer 5 · answered May 19 '21 at 07:01

1

The simplest solution is just to use a regex. I prefer str_extract from tidyverse.

data %>% str_extract("[^_]+$")
#> [1] "p10" "p9"  "p8"  "p7"  "p6"

^{Created on 2021-05-19 by the reprex package (v1.0.0)}

answered May 19 '21 at 07:01

Peter H.

1,995
8
26

R - Find & Extract string from right

5 Answers5