0

I am trying to use RegEx in Python to split a string that starts with anything and may or may not end with a year in parentheses into two groups, where the first groups should contain everything but the year, and the second should contain only the year, or nothing if there is no year.

This is what I have so far:

string1 = 'First string'
string2 = 'Second string (2013)'

p = re.compile('(.*)\s*(?:\((\d{4,4})\))?')

print(p.match(string1).groups())
print(p.match(string2).groups())

which code returns this:

('First string', None)
('Second string (2013)', None)

But I'm trying to get this:

('First string', None)
('Second string', '2013')

I realize that the first part in my RegEx is greedy, but I can't find a way to make it not greedy without matching nothing. Also, the first part of my string can contain more or less anything (including parentheses and numbers).

I realize there are ways I can work around this, but since I'm trying to learn RegEx I'd prefer a RegEx solution.

  • 1
    [*Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.*](http://regex.info/blog/2006-09-15/247) - [Jamie Zawinski](https://en.wikipedia.org/wiki/Jamie_Zawinski) –  Nov 11 '15 at 18:38
  • Possible duplicate of [How do I optionally match an additional substring with Python regular expressions?](http://stackoverflow.com/questions/17936594/how-do-i-optionally-match-an-additional-substring-with-python-regular-expression) –  Nov 11 '15 at 18:40
  • I think your learning is misguided. You should learn how to use regex for things that are made easier with regex. This is not made easier with regex. Almost nothing is made easier with regex in Python. – ArtOfWarfare Nov 11 '15 at 18:53
  • To clarify - I'm working on a real project, but I'm also trying to learn RegEx while doing it. I figured that the other ways I could come up with for solving this would require many more lines of code. I also thought it might be possible to do it in one or two lines using RegEx. I agree that it's probably not the easiest way, but I thought I'd learn something by figuring out if it can be done or not. – standard_error Nov 11 '15 at 19:01
  • The number of lines of code involved is a poor way of measuring quality of the code. The regex may or may not be more compact (my regex solution was 6 lines vs my non-regex solution of 3 lines), but the regex will almost always be harder to read/follow. Consider this: You're learning regex right now. How long is that taking? How long did it take you to learn the Python slicing in the non-regex solution? I assume the answer is that you learned the slicing much quicker, and from that I would conclude the slicing is much better from a maintenance and quality perspective. – ArtOfWarfare Nov 11 '15 at 19:41
  • I've learned and forgotten regex dozens of times. It's so rare that it's useful that you'll learn it to solve this one problem (where it's not necessary) and you'll forget it before the next time you have a problem where it might be useful. Regex is really obtuse and anything but intuitive. – ArtOfWarfare Nov 11 '15 at 19:42
  • Yes, I think you're right. I'm still curious to know if it's at all possible to solve my problem using RegEx, but for now I'll use a different solution. – standard_error Nov 11 '15 at 19:46

2 Answers2

1

Here's a simple method that does what you want:

def extractYear(s):
    if len(s) >= 6 and s[-6] == '(' and s[-5:-1].isdigit() and s[-1] == ')':
        return s[:-6], s[-6:]
    return s, None

No regex needed. Just check if it ends with a four digit number wrapped in parenthesis or not. If it does, return the two substrings with the proper split. If it doesn't, return the entire string and None.

Alternatively, if you insist on using regex, you could do something more like:

def extractYear(s):
    if len(s) >= 6:
        year = s[-6:]
        p = re.compile('\(\d{4,4}\)')
        if p.match(year):
            return s[:-6], s[-6:]
    return s, None

The pattern is checking for a year wrapped in parenthesis. It doesn't care about everything else - we're just giving it the year substring to see if it matches or not.

ArtOfWarfare
  • 20,617
  • 19
  • 137
  • 193
0

Try this: (.*)\s*(?:\((\d{4,4})\))

>>> string2 = "Second String (2013)"
>>> p = re.compile("(.*)\s*(?:\((\d{4,4})\))")
>>> p.match(string2).groups()
('Second String ', '2013')
Raunak Agarwal
  • 7,117
  • 6
  • 38
  • 62
  • That fails on the first string, giving me the error `AttributeError: 'NoneType' object has no attribute 'groups'` – standard_error Nov 11 '15 at 18:45
  • ofcourse that's expected, that just means you don't have year in your pattern. To avoid that exception don't access the `None` returned from `p.match` – Raunak Agarwal Nov 11 '15 at 19:03
  • Yes, but my question was precisely if it was possible to write a regular expression that would give me the output I want for both cases. – standard_error Nov 11 '15 at 19:09