1

I am fairly new to python so I apologies if this is quite a novice question, but I am trying to extract text from parentheses that has specific format from a raw text file. I have tried this with regular expressions, but please let me know if their is a better method.

To show what I want to do by example:

s = "Testing (Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)"

From this string I want a result something like:

['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']

The regular expression I have tried so far is

"(\(.+[,] [0-9]{4}\))"

in conjunction with re.findall(), however this only gives me the result:

['(Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)']

So, as you may have guessed, I am trying to extract the bibliographic references from a .txt file. But I don't want to extract anything that happens to be in parentheses that is not a bibliographic reference.

Again, I apologies if this is novice, and again if there is a question like this out there already. I have searched, but no luck as yet.

SamPassmore
  • 1,221
  • 1
  • 12
  • 32

3 Answers3

1

Using [^()] instead of .. This will make sure there is no nested ().

>>> re.findall("(\([^()]+[,] [0-9]{4}\))", s)
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']
zhangyangyu
  • 8,520
  • 2
  • 33
  • 43
  • Thanks! That works excellently, and in my full text file as well. Do you mind explaining how this "[^()]+" works? – SamPassmore Aug 08 '13 at 04:59
0

Assuming that you will have no nested brackets, you could use something like so: (\([^()]+?, [0-9]{4}\)). This will match any non bracket character which is within a set of parenthesis which is followed by a comma, a white space four digits and a closing parenthesis.

npinti
  • 51,780
  • 5
  • 72
  • 96
  • Ah excellent. Thank-you for the explanation! And yes there will be no nested parentheses. However, how would that change your response? – SamPassmore Aug 08 '13 at 05:03
  • @SamPassmore: Glad it worked out for you. It would change because of this: `[^()]`. This asks the engine to not match any parenthesis which are already enclosed in another set of parenthesis, so changes would need to be made to cater for nesting. – npinti Aug 08 '13 at 05:09
0

I would suggest something like \(\w+,\s+[0-9]{4}\). A couple changes from your original:

  • Match word characters (letters/numbers/underscores) instead of any character in the source name.
  • Match one or more space characters after the comma, instead of limiting yourself to a single literal space.
ajk
  • 4,473
  • 2
  • 19
  • 24