5

Possible Duplicate:
Extract info inside all parenthesis in R (regex)

I inported data from excel and one cell consists of these long strings that contain number and letters, is there a way to extract only the numbers from that string and store it in a new variable? Unfortunately, some of the entries have two sets of brackets and I would only want the second one? Could I use grep for that?

the strings look more or less like this, the length of the strings vary however:

"East Kootenay C (5901035) RDA 01011"

or like this:

"Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020"

All I want from this is 5901035 and 5933039

Any hints and help would be greatly appreciated.

Community
  • 1
  • 1
  • Is it possible for there to be two instances of numbers-in-brackets in the same line? For example, `"East Kootenay C (5901035) (5933039) RDA 01011"` – Blue Magister Oct 04 '12 at 20:52

2 Answers2

10

There are many possible regular expressions to do this. Here is one:

x=c("East Kootenay C (5901035) RDA 01011","Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020")

> gsub('.+\\(([0-9]+)\\).+?$', '\\1', x)
[1] "5901035" "5933039"

Lets break down the syntax of that first expression '.+\\(([0-9]+)\\).+'

  • .+ one or more of anything
  • \\( parentheses are special characters in a regular expression, so if I want to represent the actual thing ( I need to escape it with a \. I have to escape it again for R (hence the two \s).

  • ([0-9]+) I mentioned special characters, here I use two. the first is the parentheses which indicate a group I want to keep. The second [ and ] surround groups of things. see ?regex for more information.

  • ?$ The final piece assures that I am grabbing the LAST set of numbers in parens as noted in the comments.

I could also use * instead of . which would mean 0 or more rather than one or more i in case your paren string comes at the beginning or end of a string.

The second piece of the gsub is what I am replacing the first portion with. I used: \\1. This says use group 1 (the stuff inside the ( ) from above. I need to escape it twice again, once for the regex and once for R.

Clear as mud to be sure! Enjoy your data munging project!

Justin
  • 42,475
  • 9
  • 93
  • 111
3

Here is a gsubfn solution:

library(gsubfn)

strapplyc(x, "[(](\\d+)[)]", simplify = TRUE)

[(] matches an open paren, (\\d+) matches a string of digits creating a back-reference owing to the parens around it and finally [)] matches a close paren. The back-reference is returned.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341