Python Regular Expression - right-to-left

Question

I am trying to use regular expressions in python to match the frame number component of an image file in a sequence of images. I want to come up with a solution that covers a number of different naming conventions. If I put it into words I am trying to match the last instance of one or more numbers between two dots (eg .0100.). Below is an example of how my current logic falls down:

import os
import re    

def sub_frame_number_for_frame_token(path, token='@'):
    folder = os.path.dirname(path)
    name = os.path.basename(path)
    pattern = r'\.(\d+)\.'
    matches = list(re.finditer(pattern, name) or [])
    if not matches:
        return path

    # Get last match.
    match = matches[-1]
    frame_token = token * len(match.group(1))
    start, end = match.span()
    apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
    return os.path.join(folder, apetail_name)

# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.@@@@.exr

# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.@@@.0100.exr

I realise there are other ways in which I can solve this issue (I have already implemented a solution where I am splitting the path at the dot and taking the last item which is a number) but I am taking this opportunity to learn something about regular expressions. It appears the regular expression creates the groups from left-to-right and cannot use characters in the pattern more than once. Firstly is there anyway to search the string from right-to-left? Secondly, why doesn't the pattern find two matches in eg2 (123 and 0100)?

Cheers

@hughdbrown That is not his issue...look up `finditer`. Anyways @Yani you should look up `finditer` too. :) It returns an iterator over all *non-overlapping* matches. His second string is `'xx01_010_animation.123.0100.exr'`. The list called `matches` only returns 1 element because the matches overlap. — Shashank, Sep 12 '13 at 01:55
I'm not sure I'm understanding the code you give, but my first thought for searching a string right-to-left would be to reverse the string (`example[::-1]`). Since your pattern (`\.(\d+)\.`) is symmetrical it seems like it would work. Has that option already been ruled out? — Ali Alkhatib, Sep 12 '13 at 02:27

Carl Groner · Answer 1 · 2013-09-12T02:00:55.717

finditer will return an iterator "over all non-overlapping matches in the string".

In your example, the last . of the first match will "consume" the first . of the second. Basically, after making the first match, the remaining string of your eg2 example is 0100.exr, which doesn't match.

To avoid this, you can use a lookahead assertion (?=), which doesn't consume the first match:

>>> pattern = re.compile(r'\.(\d+)(?=\.)')

>>> pattern.findall(eg1)
['0100']

>>> pattern.findall(eg2)
['123', '0100']

>>> eg3 = 'xx01_010_animation.123.0100.500.9000.1234.exr'
>>> pattern.findall(eg3)
['123', '0100', '500', '9000', '1234']
# and "right to left"
>>> pattern.findall(eg3)[::-1]
['1234', '9000', '500', '0100', '123']

Great. This is what I want. I guess I should have read the docs a little more carefully but appreciate the practical example. Thanks! — Yani, Sep 17 '13 at 21:20

score 1 · Answer 2 · edited May 23 '17 at 10:28

My solution uses a very simple hackish way of fixing it. It reverses the string path in the beginning of your function and reverses the return value at the end of it. It basically uses regular expressions to search the backwards version of your given strings. Hackish, but it works. I used the syntax shown in this question to reverse the string.

import os
import re    

def sub_frame_number_for_frame_token(path, token='@'):
    path = path[::-1]
    folder = os.path.dirname(path)
    name = os.path.basename(path)
    pattern = r'\.(\d+)\.'
    matches = list(re.finditer(pattern, name) or [])
    if not matches:
        return path

    # Get last match.
    match = matches[-1]
    frame_token = token * len(match.group(1))
    start, end = match.span()
    apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
    return os.path.join(folder, apetail_name)[::-1]

# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.@@@@.exr

# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.123.@@@@.exr

print(eg1)
print(eg2)

score 0 · Answer 3 · answered Sep 12 '13 at 02:00

I believe the problem is that finditer returns only non-overlapping matches. Because both '.' characters are part of the regular expression, it doesn't consider the second dot as a possible start of another match. You can probably use the lookahead construct ?= to match the second dot without consuming it with "?=.".

Because of the way regular expressions work, I don't think there is an easy way to search right-to-left (though I suppose you could reverse the string and write the pattern backwards...).

Edward · Answer 4 · 2013-09-13T19:47:36.230

If all you care about is the last \.(\d+)\., then anchor your pattern from the end of the string and do a simple re.search(_):
\.(\d+)\.(?:.*?)$
where (?:.*?) is non-capturing and non-greedy, so it will consume as few characters as possible between your real target and the end of the string, and those characters will not show up in matches.
(Caveat 1: I have not tested this. Caveat 2: That is one ugly regex, so add a comment explaining what it's doing.)
UPDATE: Actually I guess you could just do a ^.*(\.\d\.) and let the implicitly greedy .* match as much as possible (including matches that occur earlier in the string) while still matching your group. That makes for a simpler regex, but I think it makes your intentions less clear.

Python Regular Expression - right-to-left

4 Answers4