Regular expression match last occurence of year in string

Question

I have written a python script with the following function, which takes as input a file name that contains multiple dates.

CODE

import re
from datetime import datetime

def ExtractReleaseYear(title):
    rg = re.compile('.*?([\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
    match = rg.search(title) # Using non-greedy match on filler
    if match:
        releaseYear = match.group(1)
        try:
            if int(releaseYear) >= 1900 and int(releaseYear) <= int(datetime.now().year) and int(releaseYear) <= 2099: # Film between 1900-2099
                return releaseYear
        except ValueError:
            print("ERROR: The film year in the file name could not be converted to an integer for comparison.")
            return ""

print(ExtractReleaseYear('2012.(2009).3D.1080p.BRRip.SBS.x264'))
print(ExtractReleaseYear('Into.The.Storm.2012.1080p.WEB-DL.AAC2.0.H264'))
print(ExtractReleaseYear('2001.A.Space.Odyssey.1968.1080p.WEB-DL.AAC2.0.H264'))

OUTPUT

Returned: 2012 -- I'd like this to be 2009 (i.e. last occurrence of year in string)

Returned: 2012 -- This is correct! (last occurrence of year is the first one, thus right)

Returned: 2001 -- I'd like this to be 1968 (i.e. last occurrence of year in string)

ISSUE

As can be observed, the regex will only target the first occurrence of a year instead of the last. This is problematic because some titles (such as the ones included here) begin with a year.

Having searched for ways to get the last occurrence of the year has led me to this resources like negative lookahead, last occurrence of repeated group and last 4 digits in URL, none of which have gotten me any closer to achieving the desired result. No existing question currently answers this unique case.

INTENDED OUTCOME

I would like to extract the LAST occurrence (instead of the first) of a year from the given file name and return it using the existing definition/function as stated in the output quote above. While I have used online regex references, I am new to regex and would appreciate someone showing me how to implement this filter to work on the file names above. Cheers guys.

Since you are always checking for group(1) it's giving you first match. Check for group length, if it's more than one, take last group's match value. — Supriya, Jan 04 '18 at 11:15
There are no real solution to do that, in particular because filenames can have many forms, for instance, there's few chances that all of your filenames have always a "p" after "1080". — Casimir et Hippolyte, Jan 04 '18 at 11:20
@ProGrammer: what if the movie title contains a year and if the release year is missing in the filename? Also instead 1900, you should choose 1895 (release year of *sortie d'usine* by Louis and Auguste Lumière). — Casimir et Hippolyte, Jan 04 '18 at 14:54
@CasimiretHippolyte I have extended the range of the lower boundary to `1888`. Suppose the title `2018.3D.1080p.BRRip.SBS.x264.AAC` is run through the function, it returns `2018` (notice movie title contains a year and the release year is missing in the filename). In this case, the last occurrence is also the first. It does become a concern if the last occurrence is some sort of upload/record date e.g. `2018.3D.1080p.BRRip.SBS.x264.AAC.01-01-2018` and I'd love to see a smart way to filter out such instances. For now, I am confident that this is a rare format in film titles. — ProGrammer, Jan 04 '18 at 22:12
The problem is that you can have different kinds of "exceptions": New.York.1997.mvk (<=1981), Los Angeles 2013.avi (<=1996), 1984.avi (<=2003), Les valseuses.1974.1920[a non word character]1080.mpg, a movie with a second version (remasterised or with additionnal scenes) with the original release year and the second version release year. — Casimir et Hippolyte, Jan 04 '18 at 23:19
I don't say that regexes are useless for this task, but if I were you, I will build a pattern to separate first, easy cases and hard cases. Then you can build categories for hard cases, but you will probably ends to handle hard cases yourself by hands. (no clear format===no rules=> no way to automate a task *without an external knowledge or a database*) — Casimir et Hippolyte, Jan 04 '18 at 23:23

score 2 · Answer 1 · answered Jan 09 '18 at 09:21

as per @kenyanke answer choosing findall() over search() will be a better option as former returns all non-overlapping matching pattern. You can choose last matching pattern as releaseYear. here is my regex to find releaseYear

rg = re.compile(r'[^a-z](\d{4})[^a-z]', re.IGNORECASE)
match = rg.findall(title)
if match:
        releaseYear = match[-1]

Above regex expression is made with an assumption that immediate letter before or after releaseYear is non-alphabet character. Result(match) for three string are

['2009']
['2012']
['1968']

score 1 · Accepted Answer · answered Jan 04 '18 at 11:15

There are two things you need to change:

The first .*? lazy pattern must be turned to greedy .* (in this case, the subpatterns after .* will match the last occurrence in the string)
The group you need to use is Group 2, not Group 1 (as it is the group that stores the year data). Or make the first capturing group non-capturing.

See this demo:

rg = re.compile('.*([\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
...
releaseYear = match.group(2)

Or:

rg = re.compile('.*(?:[\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
...
releaseYear = match.group(1)

score 1 · Answer 3 · answered Jan 04 '18 at 13:10

Consider using findall() over search()?

It will put all values found into a list from left-to-right, just gotta access the right most value to get what you want.

import re
from datetime import datetime

def ExtractReleaseYear(title):
    rg = re.compile('.*?([\[\(]?((?:19[0-9]|20[01])[0-9])[\]\)]?)', re.IGNORECASE|re.DOTALL)
    match = rg.findall(title)

    if match:
        try:
            releaseYear = match[-1][-1]
            if int(releaseYear) >= 1900 and int(releaseYear) <= int(datetime.now().year) and int(releaseYear) <= 2099: # Film between 1900-2099
                return releaseYear
        except ValueError:
            print("ERROR: The film year in the file name could not be converted to an integer for comparison.")
            return ""

print(ExtractReleaseYear('2012.(2009).3D.1080p.BRRip.SBS.x264'))
print(ExtractReleaseYear('Into.The.Storm.2012.1080p.WEB-DL.AAC2.0.H264'))
print(ExtractReleaseYear('2001.A.Space.Odyssey.1968.1080p.WEB-DL.AAC2.0.H264'))

Regular expression match last occurence of year in string

3 Answers3