38

I want to match the last occurrence of a simple pattern in a string, e.g.

list = re.findall(r"\w+ AAAA \w+", "foo bar AAAA foo2 AAAA bar2")
print "last match: ", list[len(list)-1]

However, if the string is very long, a huge list of matches is generated. Is there a more direct way to match the second occurrence of " AAAA ", or should I use this workaround?

aschultz
  • 1,658
  • 3
  • 20
  • 30
SDD
  • 1,421
  • 2
  • 20
  • 24
  • 10
    Another option could be to reverse the string (`mystr[::-1]`) and search for the first occurence of the reverse of the pattern. – ChristopheD May 10 '10 at 11:39
  • 7
    @ChristopheD, Gross! Only thing harder to understand than a regex is a backwards one. – mlissner Jul 02 '14 at 17:45
  • None of the current answers address the "long string" problem. `>>> timeit.timeit(stmt = 'regex.search(long_string)', setup = "import re; regex=re.compile('b'); long_string='a'*int(10e8)+'b'; reverse_string=long_string[::-1]", number=10)` `8.432429309999861` – Him Jan 16 '19 at 06:02
  • 1
    `>>> timeit.timeit(stmt = 'regex.search(reverse_string)', setup = "import re; regex=re.compile('b'); long_string='a'*int(10e8)+'b'; reverse_string=long_string[::-1]", number=10)` `3.3803000405896455e-05` – Him Jan 16 '19 at 06:03
  • `>>> timeit.timeit(stmt = 'regex.search(long_string)', setup = "import re; regex=re.compile('b$'); long_string='a'*int(10e8)+'b'; reverse_string=long_string[::-1]", number=10)` `7.993536103000224` – Him Jan 16 '19 at 06:03
  • @Scott Lol, saw a bounty question, did a very complete answer. – U13-Forward Jan 16 '19 at 10:11
  • @Scott Isn't reversing the string fast enough with the time you got of `3.3803000405896455e-05`? – Jerry Jan 17 '19 at 05:51
  • The string reversal takes a long time (i.e. if `reverse_string=long_string[::-1]` were in the `timeit` `stmt` instead of `setup`) The point is to illustrate what *could be* if we could `re.search` from the end. – Him Jan 17 '19 at 13:06
  • @Jerry, in fact, reversing the string before searching is considerably slower (when you include the time of string reversal): `>>> timeit.timeit(stmt = 'reverse_string=long_string[::-1]; regex.search(reverse_string)', setup = "import re; regex=re.compile('b'); long_string='a'*int(10e8)+'b'", number=10) 37.642173506000006` – Him Jan 17 '19 at 14:45
  • @Scott All right, then my next comment will be that those tests are using a string where the worst case scenario is against the traditional search and in favour of a currently non-existing search from the back. The situation would be reversed if the string was `'b'*int(10e8)+a` where it would be a best case scenario for the traditional search and a worst case scenario for the search from the back. I believe the current solutions might already be as best as we can get and the only thing that can help speed things up would be on the hardware. – Jerry Jan 17 '19 at 19:56
  • @Jerry, I'm not suggesting that always doing a search from the back is somehow better in all circumstances than doing a search from the front. However, if I expect my substring to be closer to the back than to the front, I can save considerable time by searching back first. – Him Jan 17 '19 at 21:38
  • "currently non-existing." There are things that are close-ish. For example, `str` has a method `.rindex`, which is the back-first accomplice to `.index`. However, it does not handle regular expressions. In fact, a quick google search reveals that [this](https://pypi.org/project/regex/) library has a `REVERSE` flag, which may provide just this functionality. I can't claim my own bounty, though. :) – Him Jan 17 '19 at 21:41
  • "currently non-existing" Found some related functionality [here](https://github.com/facelessuser/sublime-regex/blob/master/st3_linux_x64/regex.py) as well. – Him Jan 17 '19 at 21:46
  • I admit that I was hoping for something built-in, however. – Him Jan 17 '19 at 21:46
  • @Scott Nothing built-in, use PyPi regex module if you need that functionality badly. Or, always add `(?s).*(your_pattern_here)$` to get as quickly as possible to the end of the string, but it might require to adjust the actual pattern. – Wiktor Stribiżew Jan 18 '19 at 15:54
  • @WiktorStribizew, it looks as though my bounty isn't drumming up much excitement, so if you were to repurpose your comment as an answer I would award it. – Him Jan 20 '19 at 13:59
  • @Scott See [this answer](https://stackoverflow.com/a/54277955/3832970). – Wiktor Stribiżew Jan 20 '19 at 15:30

5 Answers5

39

you could use $ that denotes end of the line character:

>>> s = """foo bar AAAA
foo2 AAAA bar2"""
>>> re.findall(r"\w+ AAAA \w+$", s)
['foo2 AAAA bar2']

Also, note that list is a bad name for your variable, as it shadows built-in type. To access the last element of a list you could just use [-1] index:

>>> lst = [2, 3, 4]
>>> lst[-1]
4
SilentGhost
  • 307,395
  • 66
  • 306
  • 293
  • 1
    what if I came across a multiline string? – SDD May 10 '10 at 11:26
  • 3
    This won't work for an input string like this: "foo bar AAAA foo2 AAAA bar2 bar3". Of course, we don't know if such a case is possible, we don't have enough information. – tzot Jun 21 '10 at 15:31
  • 1
    Be aware that even `re.findall('abc', 'def')` will return an empty list `[]` rather than `None`. – Jia Gao Feb 25 '20 at 19:55
34

You can avoid the building of a list just by iterating over all matches and keeping the last match:

def match_last(orig_string, re_prefix, re_suffix):

    # first use positive-lookahead for the regex suffix
    re_lookahead= re.compile(f"{re_prefix}(?={re_suffix})")

    match= None
    # then keep the last match
    for match in re_lookahead.finditer(orig_string):
        pass

    if match:
        # now we return the proper match

        # first compile the proper regex…
        re_complete= re.compile(re_prefix + re_suffix)

        # …because the known start offset of the last match
        # can be supplied to re_complete.match
        return re_complete.match(orig_string, match.start())

    return match

After this, match holds either the last match or None.
This works for all combinations of pattern and searched string, as long as any possibly-overlapping regex parts are provided as re_suffix ; in this case, \w+.

>>> match_last(
    "foo bar AAAA foo2 AAAA bar2",
    r"\w+ AAAA ", r"\w+")
<re.Match object; span=(13, 27), match='foo2 AAAA bar2'>
tzot
  • 92,761
  • 29
  • 141
  • 204
  • 2
    This is very useful in a lot of usecases. You'd usually want to replace all occurrences of a pattern then custom process the last part of the input string. – Rabih Kodeih Jul 17 '13 at 12:17
4

There is no built-in re library feature that supports right-to-left string parsing, the input string is only searched for a pattern from left to right.

There is a PyPi regex module that supports this feature, however. It is regex.REVERSE flag, or its inline variation, (?r):

s="foo bar AAAA foo2 AAAA bar2"
print(regex.search(r"(?r)\w+ AAAA \w+$", s).group())
# => foo2 AAAA bar2

With re module, there is a way to quickly get to the end of string using ^[\s\S]* construct and let backtracking find the pattern that you'd like to caputure into a separate group. However, backtracking may gobble part of the match (as it will stop yielding more text once all subsequent patterns match), and in case the text is too large and there is no match, backtracking may become catastrophic. Only use this trick if your input string always matches, or if it is short and the custom pattern is not relying on backtracking much:

print(re.search(r"(?:^[\s\S]*\W)?(\w+ AAAA \w+)$", s).group(1))
# => foo2 AAAA bar2

Here, (?:^[\s\S]*\W)? matches an optional sequence of a start of string, any 0 or more chars followed with a non-word char (\W). It is necessary to add \W to make backtracking get back to the non-word char, and it must be optional as the match might start at the start of the string.

See the Python demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

I wasn't sure if your original regex would give you what you wanted. So sorry if I'm late to the party. But others may find this useful too.

import re
p = r"AAAA(?=\s\w+)" #revised per comment from @Jerry
p2 =r"\w+ AAAA \w+"
s = "foo bar AAAA foo2 AAAA bar2"
l = re.findall(p, s)
l2 = re.findall(p2, s)
print('l: {l}'.format(l=l))

#print(f'l: {l}') is nicer, but online interpreters sometimes don't support it.
# https://www.onlinegdb.com/online_python_interpreter
#I'm using Python 3.

print('l2: {l}'.format(l=l2))
for m in re.finditer(p, s):
  print(m.span())
  #A span of (n,m) would really represent characters n to m-1 with zero based index
  #So.(8,12):
  # => (8,11: 0 based index)
  # => (9th to 12th characters conventional 1 based index)
print(re.findall(p, s)[-1])

Outputs:

l: ['AAAA', 'AAAA']
l2: ['bar AAAA foo2']
(8, 12)
(18, 22)   
AAAA

The reason you get two results here instead of one in the original is the (?=) special sauce.

It's called a positive lookahead. It does not 'consume' (i.e. advance the cursor), when the match is found during the regex evaluation. So, it comes back after matching.

Although positive lookaheads are in parenthesis, they also act as a non-capture group.

So, although a pattern is matched, the results omit the surrounding sequence of alphanumeric characters represented by the \w+ and the intervening spaces, \s in my example -- representing [ \t\n\r\f\v]. (More here)

So I only get back AAAA each time.

p2 here, represents the original pattern of the code of @SDD, the person posing the question.

foo2 is consumed with that pattern, so the second AAAA would not match, as the cursor had advanced too far, when the regex engine recommences on its second iteration of matching.


I recommend taking a look at Moondra's Youtube videos if you want to dig in deeper.

He has done a very thorough 17 part series on Python regexes, beginning here


Here's a link to an online Python Interpreter.

aschultz
  • 1,658
  • 3
  • 20
  • 30
JGFMK
  • 8,425
  • 4
  • 58
  • 92
  • I don't quite understand why you have `(?=\w+\s)` in `(?=\w+\s)AAAA(?=\s\w+)`? I feel like it should actually be a positive *lookbehind*? i.e.`(?<=\w+\s)AAAA(?=\s\w+)`. Also taking into consideration the long string mention of the OP, using `re.findall()` will likely make your solution slower. – Jerry Jan 17 '19 at 05:51
  • If you tried that you'd get look-behind requires fixed-width pattern. See https://stackoverflow.com/questions/20089922/python-regex-engine-look-behind-requires-fixed-width-pattern-error – JGFMK Jan 17 '19 at 12:05
  • Exactly. Then what's the point of that lookahead? Looks like you could remove it entirely without altering the functionality – Jerry Jan 17 '19 at 12:07
  • As I indicated - "to not advance the cursor" - with a positive lookahead - not a positive lookbehind – JGFMK Jan 17 '19 at 12:09
  • 1
    Ok let me rephrase it; what is a possible scenario where `AAAA(?=\s\w+)` would match something that `(?=\w+\s)AAAA(?=\s\w+)` will avoid matching? – Jerry Jan 17 '19 at 12:11
  • Firstly. It's been 7 or 8 months since I wrote this post. At that time I had recently reviewed Moondras videos and it was fresh in my mind. TFH: I think I was just replicating original posters regex. I believe your regex would return AAAA if there was no leading space and it was at the start of the string, mine would ignore it. The original OP regex did advance the cursor, so the number of matches looked off by one. It was the not advancing cursor that was the way to solve that. It becomes apparent when you interrogate the span character ranges. – JGFMK Jan 17 '19 at 12:41
  • Well, you can blame the bounty for bringing more attention to the question :P But it's still an answer and I disagree with your statement that your regex would ignore AAAA if there was no leading space (see [here](https://regex101.com/r/OoRS8b/1)). Look, I'm not trying to attack you or your answer, I'm challenging its validity because someone might stumble on this solution and potentially use this solution which I don't believe is that good. – Jerry Jan 17 '19 at 12:55
  • Fair enough, your answer from two comments above will suffice. Revised answer to reflect that. What this addresses that the original did not is a match for each instance of AAAA (the off by one issue). That was what I was trying to add to this discussion. – JGFMK Jan 19 '19 at 12:04
  • i.e. `import re l = re.findall(r"\w+ AAAA \w+", "foo bar AAAA foo2 AAAA bar2") print(l) ` yields ['bar AAAA foo2'] and no foo2 AAAA bar2 because of the cursor advancement – JGFMK Jan 19 '19 at 12:10
0

Another fast way is using search, and group:

>>> re.search('\w+ AAAA \w+$',"foo bar AAAA foo2 AAAA bar2").group(0)
'foo2 AAAA bar2'

What it does:

  1. It uses the pattern of \w+ AAAA \w+$, which get's the last occurrence of 'AAAA' with there be-siding words alongside of them, all using \w+ (twice), and $ (once).

  2. After the process of the pattern matching, you will have to use the _sre.SRE_Match.group method to get the belonging value of the _sre.SRE_Match object, and of course get the zeroth (first) group, as know that search only retains one match (the zeroth).

Here is the regex101 of it.

Here are the timings for all of the answers (except for JGFMK's answer, since it is hard):

>>> timeit.timeit(lambda: re.findall(r"\w+ AAAA \w+$", s),number=1000000) # SilentGhost
5.783595023876842
>>> timeit.timeit('import re\nfor match in re.finditer(r"\w+ AAAA \w+", "foo bar AAAA foo2 AAAA bar2"):pass',number=1000000) # tzot
5.329235373691631
>>> timeit.timeit(lambda: re.search('\w+ AAAA \w+$',"foo bar AAAA foo2 AAAA bar2").group(0),number=1000000) # mine (U9-Forward)
5.441731174121287
>>> 

I am testing all the timings using timeit module, and also i am making number=1000000 so it takes much longer.

U13-Forward
  • 69,221
  • 14
  • 89
  • 114