0

I have a string value, say, 102-105+106-10605-10605 -10610-10610+10613. How can extract easily all values of three digits, all values of five digits. An additional task is to take into account the + or - signs before the values. Say, extract all values of 5 digits that have sign - before.

I know that there are some packages in R that enable you to do that. But I don't know how to do that exactly. I've tried various code, but unfortunately I failed each time.

From the vector I've mentioned I would like to extract all values of three digits and only five digits.

I used the code

str_extract_all(d, ("\\d{3}"))

And it gives me

[1] "102" "105" "106" "106" "106" "106" "106" "106" "106" "106". 

But I want the following result "102" "105" "106". i.e. the code should not take into acconts 5-digits values and extract from them any three digits in a row.

In case of 5-digits query str_extract_all(d, ("\\d{5}")) it gives me

[1] "10605" "10605" "10610" "10610" "10613" "10613" "10620". 

This result is true.

Dharman
  • 30,962
  • 25
  • 85
  • 135

2 Answers2

0
vect <- "102-105+106-10605-10605 -10610-10610+10613"

#Extract 3 digits
str_extract_all(vect, pattern = "[:digit:]{3}")
[[1]]
[1] "102" "105" "106" "106" "106" "106" "106" "106"

#Extract 5 digits    
str_extract_all(vect, pattern = "[:digit:]{5}")
[[1]]
[1] "10605" "10605" "10610" "10610" "10613"

#Extract 5 digits with minus sign ahead of it
str_extract_all(vect, pattern = "-[:digit:]{5}")
[[1]]
[1] "-10605" "-10605" "-10610" "-10610"

Hope this helps
For reference: https://stringr.tidyverse.org/articles/regular-expressions.html

Edit: based on your comment

vect2 <- str_split(vect, pattern = "[^[:alnum:]]")
vect2
[[1]]
[1] "102"   "105"   "106"   "10605" "10605" ""      "10610" "10610" "10613"

unlist(str_extract_all(unlist(vect2), pattern = "^[:digit:]{3}$"))
[1] "102" "105" "106"
Jamalan
  • 482
  • 4
  • 15
  • Thank you. But I need to escape value if it has more then 3 digits (in case of three digits) and escape all values that have less or more 5 digits. I.e. as a resultof three digits query I would like to get '102' '105' '106'. In case of 5 digits '10605' '10605' '10610' etc. Thans you – David Bijoyan Nov 19 '19 at 17:53
  • I think it has to be split like above edit, otherwise it'll always consider either the first three or last three of the 5 digit numbers – Jamalan Nov 19 '19 at 19:04
  • Unfortunately it gives a little bit another result then I want. That's because the example I've provided with is not perfect. I'll try one more time. Suppose we have the next vector `102-105+106-10705-10805 -10910-11010+11113`. As a result of the 3-digit's request it should list me only `102, 105, 106`. But you code provides the `107` as well if we run your code on the vector wich is not waht I want. I hope I described well. – David Bijoyan Nov 19 '19 at 19:11
  • It doesn't provide the 107. vect3 <- "102-105+106-10705-10805 -10910-11010+11113" vect4 <- str_split(vect3, pattern = "[^[:alnum:]]") unlist(str_extract_all(unlist(vect4), pattern = "^[:digit:]{3}$")) [1] "102" "105" "106" – Jamalan Nov 19 '19 at 19:21
0

You can do this like this...

library(stringr)
d<-"102-105+106-10605-10605 -10610-10610+10613"

str_match_all(d, "\\b([\\+\\-]*\\d{3})\\b")[[1]][,2]
[1] "102"  "-105" "+106"

str_match_all(d, "\\b([\\+\\-]*\\d{5})\\b")[[1]][,2]
[1] "-10605" "-10605" "10610"  "-10610" "+10613"

Delete the [\\+\\-]* if you don't want to capture the leading +/-.

The \\b is regex for a "word boundary" - the start or end of a word (or, in this case, number).

Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32