Python: Find first non-matching character

Question

Under Python, when you want to obtain the index of the first occurrence of a substring or character within a list, you use something like this:

s.find("f")

However, I'd like to find the index of the first character within the string that does not match. Currently, I'm using the following:

iNum = 0
for i, c in enumerate(line):
  if(c != mark):
    iNum = i
    break

Is there a more efficient way to do this, such as a built-in function I don't know about?

Andrew Clark · Accepted Answer · 2013-10-04T21:52:56.220

You can use regular expressions, for example:

>>> import re
>>> re.search(r'[^f]', 'ffffooooooooo').start()
4

[^f] will match any character except for f, and the start() method of a Match object (returned by re.search()) will give the index that the match occurred.

To make sure you can also handle empty strings or strings that only contain f you would want to check to make sure the result of re.search() is not None, which will happen if the regex cannot be matched. For example:

first_index = -1
match = re.search(r'[^f]', line)
if match:
    first_index = match.start()

If you prefer not to use regex, you won't do any better than your current method. You could use something like next(i for i, c in enumerate(line) if c != mark), but you would need to wrap this with a try and except StopIteration block to handle empty lines or lines that consist of only mark characters.

I had wanted to avoid using regex because it can sometimes have inferior performance compared to doing the check manually. Using a regex had slightly slower performance than the method I'm already using (1 second longer over a loop of 10,000,000 iterations), while using the next() method took 4-5 seconds longer over the same number of iterations. So, yes, you are correct: I won't do any better than my current method. — Zauberin Stardreamer, Oct 04 '13 at 22:19

score 2 · Answer 2 · answered Jun 10 '15 at 05:07

I had this same problem and looked into timing the solutions here (except the map/list-comp ones from @wwii which are significantly slower than any other options). I also added in a Cython version of the original version.

I made and tested these all in Python v2.7. I was using byte-strings (instead of Unicode strings). I am unsure if the regular-expression methods need something different to work with byte-strings in Python v3. The 'mark' is hard-coded to being the null byte. This could be easily changed.

All methods return -1 if the entire byte-string is the null-byte. All of these were tested in IPython (lines starting with % are special).

import re

def f1(s): # original version
    for i, c in enumerate(s):
        if c != b'\0': return i
    return -1

def f2(s): # @ChristopherMahan's version
    i = 0
    for c in s:
        if c != b'\0': return i
        i += 1
    return -1

def f3(s): # @AndrewClark's alternate version
    # modified to use optional default argument instead of catching StopIteration
    return next((i for i, c in enumerate(s) if c != b'\0'), -1)

def f4(s): # @AndrewClark's version
    match = re.search(br'[^\0]', s)
    return match.start() if match else -1

_re = re.compile(br'[^\0]')
def f5(s): # @AndrewClark's version w/ precompiled regular expression
    match = _re.search(s)
    return match.start() if match else -1

%load_ext cythonmagic
%%cython
# original version optimized in Cython
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def f6(bytes s):
    cdef Py_ssize_t i
    for i in xrange(len(s)):
        if s[i] != b'\0': return i
    return -1

The timing results:

s = (b'\x00' * 32) + (b'\x01' * 32) # test string

In [11]: %timeit f1(s) # original version
100000 loops, best of 3: 2.48 µs per loop

In [12]: %timeit f2(s) # @ChristopherMahan's version
100000 loops, best of 3: 2.35 µs per loop

In [13]: %timeit f3(s) # @AndrewClark's alternate version
100000 loops, best of 3: 3.07 µs per loop

In [14]: %timeit f4(s) # @AndrewClark's version
1000000 loops, best of 3: 1.91 µs per loop

In [15]: %timeit f5(s) # @AndrewClark's version w/ precompiled regular expression
1000000 loops, best of 3: 845 ns per loop

In [16]: %timeit f6(s) # original version optimized in Cython
1000000 loops, best of 3: 305 ns per loop

Overall, @ChristopherMahan's version is slightly faster than the original (apparently enumerate is slower than using your own counter). Using the next (@AndrewClark's alternate version) method is slower than the original even though it is essentially the same thing in a one-line form.

Using regular-expresions (@AndrewClark's version) are significantly faster than a loop, especially if you pre-compile the regex!

Then, if you can use Cython, it is by far the fastest. The OP's concern that using a regex is slow is validated, but a loop in Python is even slower. The loop in Cython is quite fast.

When testing with 3200 0s followed by a non-0, the timings are similar with `f1`, `f2`, and `f3` being around 200us (with `f3` the fastest and `f2` the slowest). `f4` and `f5` are around 26us, with pre-compiled being only slightly faster. The `f6` version is still the fastest, but its margin is not as large over the regular-expression versions. — coderforlife, Jun 10 '15 at 05:11
For more info on enumerate vs using your own counter, https://stackoverflow.com/questions/4852944/what-is-faster-for-loop-using-enumerate-or-for-loop-using-xrange-in-python ... it sounds like it has changed in v3. — JeopardyTempest, Feb 01 '22 at 17:39

score 1 · Answer 3 · answered Oct 04 '13 at 21:55

1

As python and as simple as possible. replace print(counter) with print counter for python 2.x

s = "ffffff5tgbh44frff"
counter = 0
for c in s:
    counter = counter + 1
    if c != "f":
        break

print (counter)

answered Oct 04 '13 at 21:55

Christopher Mahan

7,621
9
53
66

3

Actually, you can use print with parenthesis in both python2 and python3. And I'm there's much difference from what you're doing compared to what I'm already doing, aside from not using enumerate() – Zauberin Stardreamer Oct 04 '13 at 21:59
true, but someone coming across your code would have to know what enumerate does, or guess at it, which is simple enough. I don't know that there's an easier way. Is the performance bad, or do you have very long strings? – Christopher Mahan Oct 04 '13 at 22:06
Mostly I'm being OCD about micro-optimization. Your method provides slightly better performance, though it's only a gain of about 700 milliseconds over a loop of 10,000,000 iterations. – Zauberin Stardreamer Oct 04 '13 at 22:15
Ah, glad that it wasn't slower! Also, how was the regular expression answer, performance-wise? – Christopher Mahan Oct 04 '13 at 22:18
1 second slower over the same amount of iterations, and the next() method was 4-5 seconds slower. – Zauberin Stardreamer Oct 04 '13 at 22:20
1

Well, it seems that performance depends upon whether or not a comparison is done between a variable or a string literal. When I had profiled your example before, I had done the comparison as (c != mark) (since the mark would vary). However, when I change it to (c != "*"), then it runs about 2 seconds faster than my method. Although the mark will vary, this is still an interesting micro-optimization I can keep in mind. – Zauberin Stardreamer Oct 04 '13 at 22:27

score 0 · Answer 4 · edited Oct 04 '13 at 23:18

0

Now i am curious how these two fare.

>>> # map with a partial function
>>> import functools
>>> import operator
>>> f = functools.partial(operator.eq, 'f')
>>> map(f, 'fffffooooo').index(False)
5
>>> # list comprehension
>>> [c == 'f' for c in 'ffffoooo'].index(False)
4
>>>

edited Oct 04 '13 at 23:18

Christopher Mahan

7,621
9
53
66

answered Oct 04 '13 at 23:01

wwii

23,232
7
37
77

Alright, running those through my profiler, I get, with each measurement being the best of three runs, with each run being 10,000,000 iterations: - Original: 11.49 seconds - Using Map: 21.81 seconds - Using Comp: 18.25 seconds Both methods used a lot more time. I think it's because they're reading the entire string, where as the original method exits the loop once it finds a mismatched character. – Zauberin Stardreamer Oct 05 '13 at 02:26
They go through the string once to make the True/False list then have to find the first False - so more than once. I was thinking about it later in the car and realized that they would be slower - was wondering if I'd get a vote down for not thinking it through before posting. – wwii Oct 05 '13 at 04:33

score 0 · Answer 5 · answered Mar 09 '18 at 16:22

0

Here is a oneliner:

> print([a == b for (a_i, a) in enumerate("compare_me") for
(b_i, b) in enumerate("compar me") if a_i == b_i].index(False))
> 6
> "compare_me"[6]
> 'e'

answered Mar 09 '18 at 16:22

throws_exceptions_at_you

1,626
14
17

Python: Find first non-matching character

5 Answers5

Linked