Finding whether a string starts with one of a list's variable-length prefixes

Question

I need to find out whether a name starts with any of a list's prefixes and then remove it, like:

if name[:2] in ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]:
    name = name[2:]

The above only works for list prefixes with a length of two. I need the same functionality for variable-length prefixes.

How is it done efficiently (little code and good performance)?

A for loop iterating over each prefix and then checking name.startswith(prefix) to finally slice the name according to the length of the prefix works, but it's a lot of code, probably inefficient, and "non-Pythonic".

Does anybody have a nice solution?

It isn't a lot of code to do, just a lot of code to make clear. — Ignacio Vazquez-Abrams, Sep 24 '11 at 15:33
@brc the issue was that the prefixes could be multiple characters, so it wouldnt be sufficient to check `name[:2]` — Foo Bah, Sep 24 '11 at 15:37
`A for loop iterating over each prefix and then checking name.startswith(prefix) to finally slice the name according to the length of the prefix works` That sounds pretty pythonic to me. That shouldn't me more than 5 or 10 lines of code. "Pythonic" doesn't mean it has to be done in 1 line. — Falmarri, Sep 24 '11 at 17:11
I know this is a really old question but what would you want to have happen if the name starts with multiple prefixes in the list, where each of the prefixes were different lengths? ex. name = "amazing", list = ['am', 'ama', 'amaz']. Should it remove 2, 3, or 4 characters? — KrisF, Sep 29 '14 at 02:39

dm03514 · Answer 1 · 2018-07-03T12:02:28.853

49

str.startswith(prefix[, start[, end]])¶

Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.

$ ipython
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: prefixes = ("i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_")

In [2]: 'test'.startswith(prefixes)
Out[2]: False

In [3]: 'i_'.startswith(prefixes)
Out[3]: True

In [4]: 'd_a'.startswith(prefixes)
Out[4]: True

edited Jul 03 '18 at 12:02

answered Sep 24 '11 at 16:05

dm03514

54,664
18
108
145

I also need to remove the found prefix from the name in case it starts with one of the prefixes. Maybe the question was a little inaccurate, however I still like the fact that `str.startswith` also accepts a tuple. (unchecked) – Kawu Sep 24 '11 at 16:18
6

yes, because it accepts tuples it might be the cleanest implementation. – dm03514 Sep 24 '11 at 23:00

score 15 · Accepted Answer · answered Sep 24 '11 at 16:01

15

A bit hard to read, but this works:

name=name[len(filter(name.startswith,prefixes+[''])[0]):]

answered Sep 24 '11 at 16:01

Vaughn Cato

63,448
5
82
132

Very nice, this even ignores unprefixed names. Perfect. – Kawu Sep 26 '11 at 12:07
1

For those more used to list comprehensions, this is equivalent to: `name=name[len([prefix for prefix in prefixes+[''] if name.startswith(prefix)][0]):]` – Filipe Correia Sep 11 '12 at 11:40

unutbu · Answer 3 · 2011-09-24T16:00:56.407

5

for prefix in prefixes:
    if name.startswith(prefix):
        name=name[len(prefix):]
        break

edited Sep 24 '11 at 16:00

answered Sep 24 '11 at 15:41

unutbu

842,883
184
1,785
1,677

Except genexes don't leak the iterator name. – Ignacio Vazquez-Abrams Sep 24 '11 at 15:45
@unutbu: The list is about 10 prefixes long. Thanks – Kawu Sep 24 '11 at 15:55
The first solution won't work, since only the *last* value of the iterator name is leaked. – Ignacio Vazquez-Abrams Sep 24 '11 at 15:59

score 3 · Answer 4 · answered Sep 24 '11 at 16:49

3

Regexes will likely give you the best speed:

prefixes = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_", "also_longer_"]
re_prefixes = "|".join(re.escape(p) for p in prefixes)

m = re.match(re_prefixes, my_string)
if m:
    my_string = my_string[m.end()-m.start():]

answered Sep 24 '11 at 16:49

Ned Batchelder

364,293
75
561
662

@JohnMachin Couldn't he just have done `re_prefixes = '^' + "|^".join(re.escape(p) for p in prefixes)'? Thanks. – tommy.carstensen Sep 25 '19 at 04:32

score 2 · Answer 5 · answered Sep 24 '11 at 15:34

2

If you define prefix to be the characters before an underscore, then you can check for

if name.partition("_")[0] in ["i", "c", "m", "l", "d", "t", "e", "b", "foo"] and name.partition("_")[1] == "_":
    name = name.partition("_")[2]

answered Sep 24 '11 at 15:34

Foo Bah

25,660
5
55
79

I'd use `"_" in name` as your second clause to avoid partitioning the string twice, and in fact I'd put that clause first to avoid partitioning the string at all if there's no underscore in it. But good thinking. – kindall Sep 24 '11 at 16:07

etuardu · Answer 6 · 2011-09-24T17:24:18.147

What about using filter?

prefs = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]
name = list(filter(lambda item: not any(item.startswith(prefix) for prefix in prefs), name))

Note that the comparison of each list item against the prefixes efficiently halts on the first match. This behaviour is guaranteed by the any function that returns as soon as it finds a True value, eg:

def gen():
    print("yielding False")
    yield False
    print("yielding True")
    yield True
    print("yielding False again")
    yield False

>>> any(gen()) # last two lines of gen() are not performed
yielding False
yielding True
True

Or, using re.match instead of startswith:

import re
patt = '|'.join(["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"])
name = list(filter(lambda item: not re.match(patt, item), name))

score 2 · Answer 7 · answered Sep 24 '11 at 22:40

Regex, tested:

import re

def make_multi_prefix_matcher(prefixes):
    regex_text = "|".join(re.escape(p) for p in prefixes)
    print repr(regex_text)
    return re.compile(regex_text).match

pfxs = "x ya foobar foo a|b z.".split()
names = "xenon yadda yeti food foob foobarre foo a|b a b z.yx zebra".split()

matcher = make_multi_prefix_matcher(pfxs)
for name in names:
    m = matcher(name)
    if not m:
        print repr(name), "no match"
        continue
    n = m.end()
    print repr(name), n, repr(name[n:])

Output:

'x|ya|foobar|foo|a\\|b|z\\.'
'xenon' 1 'enon'
'yadda' 2 'dda'
'yeti' no match
'food' 3 'd'
'foob' 3 'b'
'foobarre' 6 're'
'foo' 3 ''
'a|b' 3 ''
'a' no match
'b' no match
'z.yx' 2 'yx'
'zebra' no match

Nice complete solution and I appreciate the escaping and testing! I'm sure this regex based approach would run faster than list comprehensions etc for any sizeable amount of data, with a fairly long list of prefixes. — RichVel, Apr 08 '13 at 16:49

score 1 · Answer 8 · answered Sep 24 '11 at 15:56

When it comes to search and efficiency always thinks of indexing techniques to improve your algorithms. If you have a long list of prefixes you can use an in-memory index by simple indexing the prefixes by the first character into a dict.

This solution is only worth if you had a long list of prefixes and performance becomes an issue.

pref = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]

#indexing prefixes in a dict. Do this only once.
d = dict()
for x in pref:
        if not x[0] in d:
                d[x[0]] = list()
        d[x[0]].append(x)


name = "c_abcdf"

#lookup in d to only check elements with the same first character.
result = filter(lambda x: name.startswith(x),\
                        [] if name[0] not in d else d[name[0]])
print result

score 0 · Answer 9 · answered Dec 04 '18 at 14:07

Could use a simple regex.

import re
prefixes = ("i_", "c_", "longer_")
re.sub(r'^(%s)' % '|'.join(prefixes), '', name)

Or if anything preceding an underscore is a valid prefix:

name.split('_', 1)[-1]

This removes any number of characters before the first underscore.

Mark Tolonen · Answer 10 · 2011-09-24T17:44:41.567

This edits the list on the fly, removing prefixes. The break skips the rest of the prefixes once one is found for a particular item.

items = ['this', 'that', 'i_blah', 'joe_cool', 'what_this']
prefixes = ['i_', 'c_', 'a_', 'joe_', 'mark_']

for i,item in enumerate(items):
    for p in prefixes:
        if item.startswith(p):
            items[i] = item[len(p):]
            break

print items

Output

['this', 'that', 'blah', 'cool', 'what_this']

score -1 · Answer 11 · answered Sep 25 '11 at 01:50

import re

def make_multi_prefix_replacer(prefixes):
    if isinstance(prefixes,str):
        prefixes = prefixes.split()
    prefixes.sort(key = len, reverse=True)
    pat = r'\b(%s)' % "|".join(map(re.escape, prefixes))
    print 'regex patern :',repr(pat),'\n'
    def suber(x, reg = re.compile(pat)):
        return reg.sub('',x)
    return suber



pfxs = "x ya foobar yaku foo a|b z."
replacer = make_multi_prefix_replacer(pfxs)               

names = "xenon yadda yeti yakute food foob foobarre foo a|b a b z.yx zebra".split()
for name in names:
    print repr(name),'\n',repr(replacer(name)),'\n'

ss = 'the yakute xenon is a|bcdf in the barfoobaratu foobarii'
print '\n',repr(ss),'\n',repr(replacer(ss)),'\n'

Finding whether a string starts with one of a list's variable-length prefixes

11 Answers11

Output

Linked