0

I've a column of data from which I need to extract a alphnumeric string/factor example

Column x
[ghjg6] [fdg5] [113gi4lki] great work 
[xzswedc: acf] [xzt8] [111eerrh5] 
[asd2] [1] [113vu17hg 115er5lgr 112cgnmbh ] get out

I want to get the data in the square brackets [113gi4lki], [111eerrh5] and [113vu17hg 115er5lgr 112cgnmbh] in a separate column. Please advise.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
kishore
  • 187
  • 4
  • 14
  • 1
    What have you tried? Please show the code you have for this specific problem and explain where you're stuck – Marius May 23 '17 at 06:37
  • I tried library(stringr) str_extract(str1, "\\[(\\w+\\s+){2,}\\w+\\]") my initial code opens a link and reads an html table. From there I took a column which has the data as shown above and I want to extract the one shown above for further calculations. – kishore May 23 '17 at 06:42
  • https://stackoverflow.com/questions/4736/learning-regular-expressions – jogo May 23 '17 at 07:03
  • https://stackoverflow.com/questions/44093818/how-to-remove-a-subset-from-a-column-within-square-brackets-in-r – jogo May 23 '17 at 09:10

2 Answers2

2

You can do:

Column.x <- c(
"[ghjg6] [fdg5] [113gi4lki] great work",
"[xzswedc: acf] [xzt8] [111eerrh5]",
"[asd2] [1] [113vu17hg 115er5lgr 112cgnmbh ] get out")
y <- gsub(".*\\[", "[", Column.x)
gsub("\\].*", "]", y)

result:

> gsub("\\].*", "]", y)
[1] "[113gi4lki]"                      "[111eerrh5]"                      "[113vu17hg 115er5lgr 112cgnmbh ]"

If you want you can put both steps together:

gsub("\\].*", "]", gsub(".*\\[", "[", Column.x))
jogo
  • 12,469
  • 11
  • 37
  • 42
  • thanks this works great, can I do this to a data frame directly whose length is 25 and how – kishore May 23 '17 at 07:39
  • I got it worked on the column as well thank, it is just missing one or two rows that is pulling wrong string. – kishore May 23 '17 at 07:45
2

To get the text inside the last set of [...] brackets, you may use a sub with the following pattern:

".*\\[([^][]+)].*"

The pattern matches:

  • .* - any 0+ chars greedily, as many as possible, up to the last occurrence of the subsequent subpatterns
  • \\[ - a literal [ (must be escaped outside of the bracket expression)
  • ([^][]+) - Group 1 (later referred to with \1) matching 1 or more chars other than ] and [
  • ] - a literal ] (no need escaping it outside of a bracket expression
  • .* - the rest of the string.

R online demo:

x <- c("[ghjg6] [fdg5] [113gi4lki] great work", "[xzswedc: acf] [xzt8] [111eerrh5]", "[asd2] [1] [113vu17hg 115er5lgr 112cgnmbh ] get out", "Some text with no brackets")
df <- data.frame(x)
df$x = sub(".*\\[([^][]+)].*", "\\1", df$x)
df

Output:

                               x
1                      113gi4lki
2                      111eerrh5
3 113vu17hg 115er5lgr 112cgnmbh 
4     Some text with no brackets

If you want to remove the entries with no [...] (like the last one in my test set), use

df$x = sub(".*\\[([^][]+)].*|.*", "\\1", df$x)

See another online R demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Stribizew thank you for the above script. I came up with something else like `str_extract(df$Subject,"[:punct:]+([:digit:]+[:alpha:]+)+[:punct:]")-> df$x2` which is providing 90% of the correct data however your line of code if giving me the rest 10%. If I use both that is df$x2 as column 1 and output from your script as column 2 I get some garbage data in column 2. I've two things to do now: 1. use both the codes and merge which should discard the garbage data and give me 100% needed data 2. try to look similar syntax for your code in my script format. ` – kishore Jun 13 '17 at 07:19
  • The `str_extract(df$Subject,"[:punct:]+([:digit:]+[:alpha:]+)+[:p‌​unct:]")` is wrong. You need to use POSIX character classes inside bracket expressions. Also, I guess by using `[[:digit:]]+[[:alpha:]]+` you want to match alphanumeric? Then use `[[:alnum:]]` and turn the `[[:punct:]]` part into lookaheads (or - if you just have `[...]`, use the `\\[` and `]`). Try `str_extract(df$Subject,"(?<=\\[)[[:alnum:]]+(?=])")` - but it won't return the *last* occurrences of [texts]. – Wiktor Stribiżew Jun 13 '17 at 07:26
  • @kishore: Please explain the rules again. Revise your question and edit accordingly as right now, it is not clear any longer what the requirements are. – Wiktor Stribiżew Jun 13 '17 at 07:28
  • @ Wiktor Stribiżew, Can we chat please, I would be able to explain clearly? – kishore Jun 13 '17 at 07:47
  • @kishore Try to add more messages here. – Wiktor Stribiżew Jun 13 '17 at 07:53
  • @kishore: Maybe `"(?<=\\[)[^]\\[]+(?=])(?!.*\\[[^]\\[]*])"` will work best for you with `str_extract`? – Wiktor Stribiżew Jun 13 '17 at 08:03
  • Or use `library(stringi)` and then `stri_extract_last(df$x,regex="(?<=\\[)[^]\\[]+(?=])") -> df$x3` – Wiktor Stribiżew Jun 13 '17 at 08:07
  • @ Wiktor Stribiżew, ok, so I need to extract anything which starts with "11" and has a length of 9 eg 113gi4lki. Just now I tried `str_extract(df$Subject,"([:digit:]+[:alpha:]+[:alnum:]+)")-> df$x1` and got 98% accuracy but I'm missing where there are many character like this [113gi4lki 114gi4lki 115gi4lki].This type of data is being pulled from your query however some garbage value is also being pulled like [ghjg6], [xzswedc: acf], [1] (this is just for eg). I want to have this 113gi4lki 114gi4lki 115gi4lki kind data only which might be separated by '/ ' like 113gi4lki / 114gi4lki /115gi4lki – kishore Jun 13 '17 at 08:08
  • Yes, I see. So the position in the string may be *any*? – Wiktor Stribiżew Jun 13 '17 at 08:11
  • Yes the string can be anywhere in the subject and in the any format either separated by space, comma or / and may not be [ ](a little rare). it is sure to start with 113,115,116 or 11 to be more precise with length 9. – kishore Jun 13 '17 at 08:16
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/146502/discussion-between-wiktor-stribizew-and-kishore). – Wiktor Stribiżew Jun 13 '17 at 08:24
  • You need [`(?<=\[)[^\w\]\[]*11[[:alnum:]]{7}(?:[^\w\]\[]+11[[:alnum:]]{7})*[^\w\]\[]*(?=])`](https://regex101.com/r/YNnzBN/1). If `[` and `]` can be missing, just remove the lookarounds: [`\b11[[:alnum:]]{7}(?:[^\w\]\[]+11[[:alnum:]]{7})*\b`](https://regex101.com/r/YNnzBN/3). Double the backslashes in R. – Wiktor Stribiżew Jun 13 '17 at 08:30
  • @kishore: (Note that SO inserts rubbish into the comment text, and the first pattern above won't work if you copy paste it from the comment above, copy from the regex101.com!) – Wiktor Stribiżew Jun 13 '17 at 08:38
  • @kishore: Please let me know if it works, and I will update the answer then. – Wiktor Stribiżew Jun 13 '17 at 08:47
  • `stri_extract_last(df$Subject,regex="(?<=\[)[^\w\]\[]*11[[:alnum:]]{7}(?:[^\w\]\[]+11[[:alnum:]]{‌​7})*[^\w\]\[]*(?=])") -> df$x3` gives the below error Error: '\[' is an unrecognized escape in character string starting ""(?<=\[" `stri_extract_last(df$Subject,regex="\b11[[:alnum:]]{7}(?:[^\w\]\[]+11[[:alnum:]]{7})*\b") -> df$x3` gives the error as Error: '\w' is an unrecognized escape in character string starting ""\b11[[:alnum:]]{7}(?:[^\w" – kishore Jun 13 '17 at 09:10
  • @kishore: I told to **double the backslashes in R**. – Wiktor Stribiżew Jun 13 '17 at 09:11
  • `regex="\\b11[[:alnum:]]{7}(?:[^\‌​\w\\]\\[]+11[[:alnum:]]‌​{7})*\\b"` - to define a backslash in an R string literal, you need to use 2 backslashes. – Wiktor Stribiżew Jun 13 '17 at 09:14