3

Regex and stringr newbie here. I have a data frame with a column from which I want to find 10-digit numbers and keep only the first three digits. Otherwise, I want to just keep whatever is there.

So to make it easy let's just pretend it's a simple vector like this:

new<-c("111", "1234567891", "12", "12345")

I want to write code that will return a vector with elements: 111, 123, 12, and 12345. I also need to write code (I'm assuming I'll do this iteratively) where I extract the first two digits of a 5-digit string, like the last element above.

I've tried:

gsub("\\d{10}", "", new)

but I don't know what I could put for the replacement argument to get what I'm looking for. Also tried:

str_replace(new, "\\d{10}", "")

But again I don't know what to put in for the replacement argument to get just the first x digits.

Edit: I disagree that this is a duplicate question because it's not just that I want to extract the first X digits from a string but that I need to do that with specific strings that match a pattern (e.g., 10 digit strings.)

panpsych77
  • 33
  • 4
  • Try `sub("^(\\d{3})\\d{7}$", "\\1", new)`, see [this demo](https://ideone.com/hrZKVL) – Wiktor Stribiżew Jun 08 '19 at 16:30
  • 1
    Please don't use variable names like `new`. – NelsonGon Jun 08 '19 at 16:38
  • @NelsonGon, I don't usually but thanks! It was just for the example. – panpsych77 Jun 08 '19 at 18:06
  • @WiktorStribiżew, thank you. Could you expand this to an answer below and explain how the code works? I will then accept it as the answer. I can see that it works but want to be sure I understand how, especially since I may need to reuse the code for other purposes. – panpsych77 Jun 08 '19 at 19:14
  • @panpsych77 [Posted](https://stackoverflow.com/a/56509435/3832970) with explanations and a demo. – Wiktor Stribiżew Jun 08 '19 at 19:21
  • The question was erroneously close with [this](https://stackoverflow.com/questions/51402052/extract-first-n-digits-from-a-string) and [this](https://stackoverflow.com/questions/38750535/extract-the-first-2-characters-in-a-string) links, the problem here is not extracting first digits or chars, but getting a specific digit chunk at the start of the string **if** the rest of the string meets a specific pattern. The fact that other answers but mine do not consider that does not make this question a dupe of the mentioned questions. – Wiktor Stribiżew Jun 20 '19 at 10:43

3 Answers3

2

If you are willing to use the library stringr from which comes the str_replace you are using. Just use str_extract

vec <- c(111, 1234567891, 12)
str_extract(vec, "^\\d{1,3}")

The regex ^\\d{1,3} matches at least 1 to a maximum of 3 digits occurring right in the beginning of the phrase. str_extract, as the name implies, extracts and returns these matches.

bpbutti
  • 381
  • 1
  • 8
  • Thanks. The only problem is that there are some strings where I need to do something different. That is, there are strings that have 5 digits where I need to extract only the first two. So I was looking for a way to match a particular pattern and then replace rather than generally pull up to 3 of the first digits from all strings in the vector. I hope that clarifies. I edited my question to reflect this. – panpsych77 Jun 08 '19 at 16:52
1

You can use:

 as.numeric(substring(my_vec,1,3)) 
#[1] 111 123  12
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
1

You may use

new<-c("111", "1234567891", "12")
sub("^(\\d{3})\\d{7}$", "\\1", new)
## => [1] "111" "123" "12" 

See the R online demo and the regex demo.

Regex graph:

enter image description here

Details

  • ^ - start of string anchor
  • (\d{3}) - Capturing group 1 (this value is accessed using \1 in the replacement pattern): three digit chars
  • \d{7} - seven digit chars
  • $ - end of string anchor.

So, the sub command only matches strings that are composed only of 10 digits, captures the first three into a separate group, and then replaces the whole string (as it is the whole match) with the three digits captured in Group 1.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    I want to note that this also addresses the issue I mentioned in my edit that I would like to use the code iteratively in order to pick out strings of other lengths (e.g., 4 digits) and replace with first 2 digits, etc., for example by changing it to: sub("^(\\d{2})\\d{2}$", "\\1", new). Thanks for a very helpful and detailed answer (I don't have enough reputation points to upvote it). – panpsych77 Jun 08 '19 at 19:28