Extract last substring between square brackets

Question

I've a column of data from which I need to extract a alphnumeric string/factor example

Column x
[ghjg6] [fdg5] [113gi4lki] great work 
[xzswedc: acf] [xzt8] [111eerrh5] 
[asd2] [1] [113vu17hg 115er5lgr 112cgnmbh ] get out

I want to get the data in the square brackets [113gi4lki], [111eerrh5] and [113vu17hg 115er5lgr 112cgnmbh] in a separate column. Please advise.

What have you tried? Please show the code you have for this specific problem and explain where you're stuck — Marius, May 23 '17 at 06:37
I tried library(stringr) str_extract(str1, "\\[(\\w+\\s+){2,}\\w+\\]") my initial code opens a link and reads an html table. From there I took a column which has the data as shown above and I want to extract the one shown above for further calculations. — kishore, May 23 '17 at 06:42
https://stackoverflow.com/questions/4736/learning-regular-expressions — jogo, May 23 '17 at 07:03
https://stackoverflow.com/questions/44093818/how-to-remove-a-subset-from-a-column-within-square-brackets-in-r — jogo, May 23 '17 at 09:10

score 2 · Accepted Answer · answered May 23 '17 at 06:52

2

You can do:

Column.x <- c(
"[ghjg6] [fdg5] [113gi4lki] great work",
"[xzswedc: acf] [xzt8] [111eerrh5]",
"[asd2] [1] [113vu17hg 115er5lgr 112cgnmbh ] get out")
y <- gsub(".*\\[", "[", Column.x)
gsub("\\].*", "]", y)

result:

> gsub("\\].*", "]", y)
[1] "[113gi4lki]"                      "[111eerrh5]"                      "[113vu17hg 115er5lgr 112cgnmbh ]"

If you want you can put both steps together:

gsub("\\].*", "]", gsub(".*\\[", "[", Column.x))

answered May 23 '17 at 06:52

jogo

12,469
11
37
42

thanks this works great, can I do this to a data frame directly whose length is 25 and how – kishore May 23 '17 at 07:39
I got it worked on the column as well thank, it is just missing one or two rows that is pulling wrong string. – kishore May 23 '17 at 07:45

score 2 · Answer 2 · answered Jun 12 '17 at 19:41

2

To get the text inside the last set of [...] brackets, you may use a sub with the following pattern:

".*\\[([^][]+)].*"

The pattern matches:

.* - any 0+ chars greedily, as many as possible, up to the last occurrence of the subsequent subpatterns
\\[ - a literal [ (must be escaped outside of the bracket expression)
([^][]+) - Group 1 (later referred to with \1) matching 1 or more chars other than ] and [
] - a literal ] (no need escaping it outside of a bracket expression
.* - the rest of the string.

R online demo:

x <- c("[ghjg6] [fdg5] [113gi4lki] great work", "[xzswedc: acf] [xzt8] [111eerrh5]", "[asd2] [1] [113vu17hg 115er5lgr 112cgnmbh ] get out", "Some text with no brackets")
df <- data.frame(x)
df$x = sub(".*\\[([^][]+)].*", "\\1", df$x)
df

Output:

                               x
1                      113gi4lki
2                      111eerrh5
3 113vu17hg 115er5lgr 112cgnmbh 
4     Some text with no brackets

If you want to remove the entries with no [...] (like the last one in my test set), use

df$x = sub(".*\\[([^][]+)].*|.*", "\\1", df$x)

See another online R demo.

answered Jun 12 '17 at 19:41

Wiktor Stribiżew

607,720
39
448
563

Stribizew thank you for the above script. I came up with something else like `str_extract(df$Subject,"[:punct:]+([:digit:]+[:alpha:]+)+[:punct:]")-> df$x2` which is providing 90% of the correct data however your line of code if giving me the rest 10%. If I use both that is df$x2 as column 1 and output from your script as column 2 I get some garbage data in column 2. I've two things to do now: 1. use both the codes and merge which should discard the garbage data and give me 100% needed data 2. try to look similar syntax for your code in my script format. ` – kishore Jun 13 '17 at 07:19
The `str_extract(df$Subject,"[:punct:]+([:digit:]+[:alpha:]+)+[:p‌unct:]")` is wrong. You need to use POSIX character classes inside bracket expressions. Also, I guess by using `[[:digit:]]+[[:alpha:]]+` you want to match alphanumeric? Then use `[[:alnum:]]` and turn the `[[:punct:]]` part into lookaheads (or - if you just have `[...]`, use the `\\[` and `]`). Try `str_extract(df$Subject,"(?<=\\[)[[:alnum:]]+(?=])")` - but it won't return the *last* occurrences of [texts]. – Wiktor Stribiżew Jun 13 '17 at 07:26
@kishore: Please explain the rules again. Revise your question and edit accordingly as right now, it is not clear any longer what the requirements are. – Wiktor Stribiżew Jun 13 '17 at 07:28
@ Wiktor Stribiżew, Can we chat please, I would be able to explain clearly? – kishore Jun 13 '17 at 07:47
@kishore Try to add more messages here. – Wiktor Stribiżew Jun 13 '17 at 07:53
@kishore: Maybe `"(?<=\\[)[^]\\[]+(?=])(?!.*\\[[^]\\[]*])"` will work best for you with `str_extract`? – Wiktor Stribiżew Jun 13 '17 at 08:03
Or use `library(stringi)` and then `stri_extract_last(df$x,regex="(?<=\\[)[^]\\[]+(?=])") -> df$x3` – Wiktor Stribiżew Jun 13 '17 at 08:07
@ Wiktor Stribiżew, ok, so I need to extract anything which starts with "11" and has a length of 9 eg 113gi4lki. Just now I tried `str_extract(df$Subject,"([:digit:]+[:alpha:]+[:alnum:]+)")-> df$x1` and got 98% accuracy but I'm missing where there are many character like this [113gi4lki 114gi4lki 115gi4lki].This type of data is being pulled from your query however some garbage value is also being pulled like [ghjg6], [xzswedc: acf], [1] (this is just for eg). I want to have this 113gi4lki 114gi4lki 115gi4lki kind data only which might be separated by '/ ' like 113gi4lki / 114gi4lki /115gi4lki – kishore Jun 13 '17 at 08:08
Yes, I see. So the position in the string may be *any*? – Wiktor Stribiżew Jun 13 '17 at 08:11
Yes the string can be anywhere in the subject and in the any format either separated by space, comma or / and may not be [ ](a little rare). it is sure to start with 113,115,116 or 11 to be more precise with length 9. – kishore Jun 13 '17 at 08:16
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/146502/discussion-between-wiktor-stribizew-and-kishore). – Wiktor Stribiżew Jun 13 '17 at 08:24
You need [`(?<=\[)[^\w\]\[]*11[[:alnum:]]{7}(?:[^\w\]\[]+11[[:alnum:]]{7})*[^\w\]\[]*(?=])`](https://regex101.com/r/YNnzBN/1). If `[` and `]` can be missing, just remove the lookarounds: [`\b11[[:alnum:]]{7}(?:[^\w\]\[]+11[[:alnum:]]{7})*\b`](https://regex101.com/r/YNnzBN/3). Double the backslashes in R. – Wiktor Stribiżew Jun 13 '17 at 08:30
@kishore: (Note that SO inserts rubbish into the comment text, and the first pattern above won't work if you copy paste it from the comment above, copy from the regex101.com!) – Wiktor Stribiżew Jun 13 '17 at 08:38
@kishore: Please let me know if it works, and I will update the answer then. – Wiktor Stribiżew Jun 13 '17 at 08:47
`stri_extract_last(df$Subject,regex="(?<=\[)[^\w\]\[]*11[[:alnum:]]{7}(?:[^\w\]\[]+11[[:alnum:]]{‌7})*[^\w\]\[]*(?=])") -> df$x3` gives the below error Error: '\[' is an unrecognized escape in character string starting ""(?<=\[" `stri_extract_last(df$Subject,regex="\b11[[:alnum:]]{7}(?:[^\w\]\[]+11[[:alnum:]]{7})*\b") -> df$x3` gives the error as Error: '\w' is an unrecognized escape in character string starting ""\b11[[:alnum:]]{7}(?:[^\w" – kishore Jun 13 '17 at 09:10
@kishore: I told to **double the backslashes in R**. – Wiktor Stribiżew Jun 13 '17 at 09:11
`regex="\\b11[[:alnum:]]{7}(?:[^\‌\w\\]\\[]+11[[:alnum:]]‌{7})*\\b"` - to define a backslash in an R string literal, you need to use 2 backslashes. – Wiktor Stribiżew Jun 13 '17 at 09:14

Extract last substring between square brackets

2 Answers2