How to extract specific strings along with their corresponding numeric values?

Question

I have following column 'checks' in my data frame 'B' which has input statments in different rows. These statements have a variable 'abc' , and corresponding to them there is a value entry as well. The entry done are manual and are not coherent for each entry. I have to extract just 'abc' and followed by its 'value'

< B$checks

    rows    Checks
    [1] there was no problem  reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue
    [2] abc(107 to 109) xyz 115 jbo xyz 104 optim
    [3] problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem
    [4] abc_107 xyz 116 dor problem 
    [5] surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 
    [6] ping test ok abc(86 rxlevel 84
    [7] field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL  No Building class Residential Building Type Multi
    [8] abc 89 xyz 99 so as the user has no problem , check ping test

Expected output

rows    Variable    Value
        [1] abc 96
        [2] abc 107
        [3] abc 95
        [4] abc 107
        [5] abc 103
        [6] abc 86
        [7] abc 86
        [8] abc 89

I tried the following using references under similar queries

usisng str_match

library(stringr)
m1 <- str_match(B$checks, "abc.*?([0-200.]{1,})")  # value is between 0 to 200

which yielded some thing like below

    row var value
1   abc-96 xyz 450  0
2   abc(10  10
3   abc 95 1    1
4   abc_10  10
5   abc 10  10
6   NA  NA
7   NA  NA
8   NA  NA

Then I tried the following

B$Checks <- gsub("-", " ", B$Checks)
B$Checks <- gsub("/", " ", B$Checks)
B$Checks <- gsub("_", " ", B$Checks)
B$Checks <- gsub(":", " ", B$Checks)
B$Checks <- gsub(")", " ", B$Checks)
B$Checks <- gsub("((((", " ", B$Checks)
B$Checks <- gsub(".*abc", "abc", B$Checks) 
B$Checks <- gsub("[[:punct:]]", " ", B$Checks)
regexp <- "[[:digit:]]+"   
m <- str_extract(B$Checks, regexp) 
m <- as.data.frame(m)

and was able to get the "expected output",

But now I am looking for following

1) Simpler set of commands or way to extract the expected output

2) Get values which are represented as range e.g. I want the below input row

rows    Checks
[2] abc(107 to 109) xyz 115 jbo xyz 104 optim

as

output >

rows    Variable    Value1 Value2
 [2]     abc        107   109

Need the solution for 1) and 2) as am working on larger data sets with same patterns and lot of mixed Variable-Value combinations.

Thanks in advance.

Try `sub(".*?(abc)\\D+(\\d+).*", "\\1 \\2", B$Checks)`. Note that `[0-200]` is [a wrong way to match number ranges](https://stackoverflow.com/questions/3148240/why-doesnt-01-12-range-work-as-expected). — Wiktor Stribiżew, May 23 '18 at 11:40
actually, `[0-200.]{1,}` doesn't check for values between 0 to 200 but an unlimited amount of characters amongst the values "0" "1" 2" "." — Kaddath, May 23 '18 at 11:42
@rock321987 i actually have lots of rows which are similar to patterns listed in point 2) , is their any way I can extract numeric values by modifying the syntax in the solution given — smokinjoe, May 24 '18 at 10:03
@smokinjoe I understand the accepted solution worked better, right? If not, I will post my answer with necessary explanations. — Wiktor Stribiżew, May 24 '18 at 14:40

Cath · Accepted Answer · 2018-05-23T12:02:34.403

3

You need to capture the digits, specifying that you want abc prior to the digits with lookbehind:

Value <- sub(".*(?<=abc)(\\D+)?(\\d*)\\D?.*", "\\2", str, perl=TRUE)
# Value
#[1] "96"  "107" "95"  "107" "103" "86"  "86"  "89"

You can then put the values in a data.frame:

B <- data.frame(Variable="abc", Value=as.numeric(Value))
head(B, 3)
#  Variable Value
#1      abc    96
#2      abc   107
#3      abc    95

data

str <- c("there was no problem  reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue", 
"abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem", 
"abc_107 xyz 116 dor problem", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ", 
"ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL  No Building class Residential Building Type Multi", 
"abc 89 xyz 99 so as the user has no problem , check ping test")

edited May 23 '18 at 12:02

answered May 23 '18 at 11:40

Cath

23,906
5
52
86

@rock321987 you mean change `?` into `*`? I don't see the benefit. Else, using your pattern won't work so I don't really see what you mean actually – Cath May 23 '18 at 11:50
It entirely depends upon input of OP.. it _might_ be possible to have a string like this `abc--96`.. FYI, your regex will fail for input `abc-96` – rock321987 May 23 '18 at 11:54
@rock321987 indeed I based the regex on OP example, I'll modify it to take into account the possible lack of other characters – Cath May 23 '18 at 12:01
@rock321987 the patterns you put in your first comment will work in `sub` for `abc--96` but fail for all OP's examples (and a lot of others) – Cath May 23 '18 at 12:08
my regex is working as expected. check [here](https://regex101.com/r/fMl9TZ/1) on OPs test cases.. `(\\D+)?` is same as `(\\D*)` and `\\D?.*` isn't needed.. though both of these won't have any impact – rock321987 May 23 '18 at 12:23
@rock321987 your pattern may be correct "in general" but it does not work with capturing with `sub` in `R`: try `sub("abc\\D*(\\d+)", "\\1", "abc(107 to 109) xyz")`, it gives `"107 to 109) xyz"` – Cath May 23 '18 at 12:28
2

@rock321987 well having the pattern is good, getting the desired output is better... ;-) – Cath May 23 '18 at 12:32

s_baldur · Answer 2 · 2018-05-23T12:33:16.510

Using gsub() twice and magrittr for better readibility:

library(magrittr)

data.frame(
  Variable = "abc",
  Value = data %>%
    gsub(".*(abc.{6}).*", "\\1", .) %>%
    gsub("[^0-9]+(\\d+).*", "\\1", .)
)
  Variable Value
1      abc    96
2      abc   107
3      abc    95
4      abc   107
5      abc   103
6      abc    86
7      abc    86
8      abc    89

First we get extract abc and the next 6 characters after and then extract the first integer to appear.

data:

data <- c("there was no problem  reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue", 
"abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem", 
"abc_107 xyz 116 dor problem ", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ", 
"ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL  No Building class Residential Building Type Multi", 
"abc 89 xyz 99 so as the user has no problem , check ping test"
)

score 0 · Answer 3 · answered May 24 '18 at 03:29

Using stringr for manipulating strings and rebus to write readable regex:

library(stringr)
library(rebus)
str_match(checks, pattern = capture("abc") %R% optional(or1(c(SPC, PUNCT))) %R% capture(one_or_more(DGT)))

output:

     [,1]      [,2]  [,3] 
[1,] "abc-96"  "abc" "96" 
[2,] "abc(107" "abc" "107"
[3,] "abc 95"  "abc" "95" 
[4,] "abc_107" "abc" "107"
[5,] "abc 103" "abc" "103"
[6,] "abc(86"  "abc" "86" 
[7,] "abc-86"  "abc" "86" 
[8,] "abc 89"  "abc" "89"

data:

checks <- c("there was no problem  reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue", 
            "abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem", 
            "abc_107 xyz 116 dor problem", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ", 
            "ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL  No Building class Residential Building Type Multi", 
            "abc 89 xyz 99 so as the user has no problem , check ping test")

How to extract specific strings along with their corresponding numeric values?

3 Answers3