Extracting parenthesis with a specific format with Python

Question

I am fairly new to python so I apologies if this is quite a novice question, but I am trying to extract text from parentheses that has specific format from a raw text file. I have tried this with regular expressions, but please let me know if their is a better method.

To show what I want to do by example:

s = "Testing (Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)"

From this string I want a result something like:

['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']

The regular expression I have tried so far is

"(\(.+[,] [0-9]{4}\))"

in conjunction with re.findall(), however this only gives me the result:

['(Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)']

So, as you may have guessed, I am trying to extract the bibliographic references from a .txt file. But I don't want to extract anything that happens to be in parentheses that is not a bibliographic reference.

Again, I apologies if this is novice, and again if there is a question like this out there already. I have searched, but no luck as yet.

score 1 · Answer 1 · answered Aug 08 '13 at 04:56

1

Using [^()] instead of .. This will make sure there is no nested ().

>>> re.findall("(\([^()]+[,] [0-9]{4}\))", s)
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']

answered Aug 08 '13 at 04:56

zhangyangyu

8,520
2
33
43

Thanks! That works excellently, and in my full text file as well. Do you mind explaining how this "[^()]+" works? – SamPassmore Aug 08 '13 at 04:59

score 0 · Accepted Answer · answered Aug 08 '13 at 04:57

0

Assuming that you will have no nested brackets, you could use something like so: (\([^()]+?, [0-9]{4}\)). This will match any non bracket character which is within a set of parenthesis which is followed by a comma, a white space four digits and a closing parenthesis.

answered Aug 08 '13 at 04:57

npinti

51,780
5
72
96

Ah excellent. Thank-you for the explanation! And yes there will be no nested parentheses. However, how would that change your response? – SamPassmore Aug 08 '13 at 05:03
@SamPassmore: Glad it worked out for you. It would change because of this: `[^()]`. This asks the engine to not match any parenthesis which are already enclosed in another set of parenthesis, so changes would need to be made to cater for nesting. – npinti Aug 08 '13 at 05:09

score 0 · Answer 3 · answered Aug 08 '13 at 04:59

I would suggest something like \(\w+,\s+[0-9]{4}\). A couple changes from your original:

Match word characters (letters/numbers/underscores) instead of any character in the source name.
Match one or more space characters after the comma, instead of limiting yourself to a single literal space.

Extracting parenthesis with a specific format with Python

3 Answers3