2

I found a nice question where one can search for multiple endings of a string using: endswith(tuple)

Check if string ends with one of the strings from a list

My question is, how can I return which value from the tuple is actually found to be the match? and what if I have multiple matches, how can I choose the best match?

for example:

str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ('AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA')
str.endswith(endings) ## this will return true for all of values inside the tuple, but how can I get which one matches the best

In this case, multiple matches can be found from the tuple, how can I deal with this and return only the best (biggest) match, which in this case should be: 'AAAAAAAAA' which I want to remove at the end (which can be done with a regular expression or so).

I mean one could do this in a for loop, but maybe there is an easier pythonic way?

Community
  • 1
  • 1
ifreak
  • 1,726
  • 4
  • 27
  • 45
  • 3
    Have you considered a regular expression instead? `A{5,9}+$` would match all those endings too and a match object will tell you what matched. – Martijn Pieters Sep 02 '15 at 09:53
  • 3
    It it's always the same letter and you want to remove it, why not `.rstrip('A')`? Note that `str` is a bad name for a string, as it shadows the built-in. – jonrsharpe Sep 02 '15 at 09:55
  • 3
    `endswith(tuple)` is trivially implemented as *loop over tuple, run invidual match on string*, so it's really only for convenience (it's not faster). – dhke Sep 02 '15 at 09:59

4 Answers4

2
>>> s = "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
>>> endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']
>>> max([i for i in endings if s.endswith(i)],key=len)
'AAAAAAAAA'
Jonas Byström
  • 25,316
  • 23
  • 100
  • 147
1
import re
str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']

print max([i for i in endings if re.findall(i+r"$",str)],key=len)
vks
  • 67,027
  • 10
  • 91
  • 124
1

How about:

len(str) - len(str.rstrip('A'))
sureshvv
  • 4,234
  • 1
  • 26
  • 32
-1

str.endswith(tuple) is (currently) implemented as a simple loop over tuple, repeatedly re- running the match, any similarities between the endings are not taken into account.

In the example case, a regular expression should compile into an automaton that essentially runs in linear time:

regexp = '(' + '|'.join(
   re.escape(ending) for ending in sorted(endings, key=len, reverse=True
) + ')$'

Edit 1: As pointed out correctly by Martijn Pieters, Python's re does not return the longest overall match, but for alternates only matches the first matching subexpression:

https://docs.python.org/2/library/re.html#module-re:

When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.

(emphasis mine)

Hence, unfortunately the need for sorting by length.

Note that this makes Python's re different from POSIX regular expressions, which match the longest overall match.

dhke
  • 15,008
  • 2
  • 39
  • 56
  • No, a `|` delimited set of options matches the *first*, not the longest. Greediness applies to multipliers, not to `|` groups. You would have to sort the endings in reverse order of length. – Martijn Pieters Sep 02 '15 at 10:09
  • `>>> re.search('AA|AAAAA', 'AAAAA').group(0)` returns `AA`, not `AAAAA`. – Martijn Pieters Sep 02 '15 at 10:11
  • @MartijnPieters Can you point me to where this is specified? Because it makes Python's `re` different from POSIX. I just checked again and `regexec()` clearly finds the longest overall match as specified. – dhke Sep 02 '15 at 10:31
  • Okay, found it. `re` [documents it](https://docs.python.org/2/library/re.html#module-re). As noted, I consider this quite the gotcha, because it's different from the rest of the world ... – dhke Sep 02 '15 at 10:43
  • It is not different. You are confusing the quantifiers (`?`, `*`, `+`, `{m,n}`) which do have greedy and non-greedy behaviour, with alternates (the `|`) symbol. Regex is *eager*, not greedy, there, so first match satisfies. Different engines can make different choices here, Python's is regex-directed. See http://www.regular-expressions.info/alternation.html – Martijn Pieters Sep 02 '15 at 11:06
  • @MartijnPieters I do not think so. In your own reference, the last heading actually states the POSIX requirements, which *requires the longest match be returned* and then complains that this is *inefficient* (which is true). Are you sure you are not generalizing from Python to other engines? Because, as noted, `regexec()` *behaves differently* (and it should). – dhke Sep 02 '15 at 11:18
  • Sorry, I did not mean to generalise to other engines. There are two choices here, Python picked one. But there are other engines that made the same choice. – Martijn Pieters Sep 02 '15 at 11:29
  • @MartijnPieters I feel obliged to re-state that you are totally right wrt to Python`s re, it does indeed did left-right-first-match and is specified to do so. I just did not know that and by default assumed POSIX. So thank you again for pointing that out. – dhke Sep 02 '15 at 11:31
  • There is a new regex engine in development for Python core, see https://pypi.python.org/pypi/regex, but it did not switch behaviour here (I tested). Perhaps a feature request for the engine? – Martijn Pieters Sep 02 '15 at 11:35
  • @MartijnPieters Yep, I just decided to suggest this as a low-priority proposal. I'm not to angry if it doesn't happen though, as it looks like short-circuit matching has become the de-facto standard (probably because PCRE). – dhke Sep 02 '15 at 12:23