0

I know there are a few similar questions, but they did not help me, perhaps due to my lack of understanding the basics of string manipulation.

I have a piece of string that I want to extract the inside of its first square brackets.

x <- "cons/mod2/det[4]/rost2/rost_act[2]/Q2w5"

I have looked all over the internet to assemble the following code but it gives me inside of 2nd brackets

sub(".*\\[(.*)\\].*", "\\1", x, perl=TRUE)

The code returns 2. I expect to get 4.

Would appreciate if someone points out the missing piece.

---- update ----

Replacing .* to .*? in the first two instances worked, but do not know how. I leave the question open for someone who can provide why this works:

sub(".*?\\[(.*?)\\].*", "\\1", x, perl=TRUE)
Masood Sadat
  • 1,247
  • 11
  • 18
  • 1
    You can subset the 1st value after using the accepted answer from here ? https://stackoverflow.com/questions/2403122/regular-expression-to-extract-text-between-square-brackets – Ronak Shah Aug 29 '18 at 02:55
  • 1
    `sub(".*\\[(.*?)\\].*", "\\1", x) ` seems to work as per link @RonakShah suggests. – zacdav Aug 29 '18 at 02:57
  • Thanks @zacdav, that partially helped. I changed the second `.*` to `.*?` which worked, but don't know how. Thanks Ronak for reference but I couldn't get help there – Masood Sadat Aug 29 '18 at 03:15
  • @msd `\[(\d)(?:.*)` this matches the first '4' – The Scientific Method Aug 29 '18 at 04:16
  • `.*?` works because it is a lazy `*`. That is, you match 0 to N times and you preffer the shortest possible match (as opposed to regular `*` that preffers the longest possible match) – Julio Aug 29 '18 at 06:42

2 Answers2

1

You're almost there:

sub("^[^\\]]*\\[(\\d+)\\].*", "\\1", x, perl=TRUE)
## [1] "4"

The original problem is that .* matches as much as possible of anything before it matches [. Your solution was *? which is lazy version of * (non-greedy, reluctant) matches as little as it can.

Completely valid, another alternative I used is [^\\]]*: which translates into match anything that is not ].

s_baldur
  • 29,441
  • 4
  • 36
  • 69
0

stringr

You can solve this with base R, but I usually prefer the functions from the stringr-package when handeling such 'problems'.

x <- "cons/mod2/det[4]/rost2/rost_act[2]/Q2w5"

If you want only the first string between brackets, use str_extract:

stringr::str_extract(x, "(?<=\\[).+?(?=\\])")
# [1] "4"

If you want all the strings between brackets, use str_extract_all:

stringr::str_extract_all(x, "(?<=\\[).+?(?=\\])")
# [[1]]
# [1] "4" "2" 
Wimpel
  • 26,031
  • 1
  • 20
  • 37