Given two strings, what is the fastest way to remove all positions at which they both match a given character?

Question

This is a speed question.

I have two strings x and y of the same length. At certain positions (indices) they may have the same character (e.g., "A"). I wish to remove these characters, and squeeze together what remains.

For example, denoting non-A characters with a period ("."):

x = "....A..A....AA...A"
y = ".A.....A....AA...."

new_x, new_y = f(x, y) should return:

new_x = "....A.........A"
new_y = ".A............."

Currently, I do this with a zip/unzip and a join, as in the code snippet below. But I find this to be too slow for my needs, as I need to do this a few hundred thousand times at a time:

import timeit
setup = '''
from itertools import izip
x = 20*"....A..A....AA...A" # my actual strings are length 300-400, thus the 20*
y = 20*".A.....A....AA...."
def f(x, y, two_As=("A", "A")):
    new_x, new_y = izip(*(pair for pair in izip(x, y) if pair != two_As))
    new_x = "".join(new_x)
    new_y = "".join(new_y)
'''
>>> timeit.timeit('f(x, y)', setup=setup, number=int(1e5))
8.456903219223022

The few things I tried did not speed things up. Caching is an option (though as you might have guessed, it will not always be the same two strings!), but it seems as though this is a very simple operation and there might be a substantial speedup if done the right way.

(I note ~80% of the time is taken by the zipping/unzipping, and only 20% by the subsequent "".join() calls, but if that can be made faster, too, I'm all ears.)

I would imagine/unreasonably hope that this whole thing should take no more than 10 microseconds (~1 second for 100,000 calls).

@EricDuminil: No; the strings can contain any letters. However, it is only ever a single letter that I wish to remove. I.e., in my example above, "." can be any letter *except* "A", and I only ever want to remove "A" (but only when it appears in the same place in both strings). — user1071516, Feb 10 '17 at 20:20
Your code seems fast enough. I tried converting the strings to binary and calculating `b1 & b2` to get `000000010000110000`, it doesn't help much and is slower than your method. — Eric Duminil, Feb 10 '17 at 21:10
Thanks for the try, @EricDuminil. Good to know that binary doesn't help here. — user1071516, Feb 10 '17 at 22:21
@user2357112: thanks for the pointer. I am looking into string manipulations in Cython now, but if you have some specific advice I'd appreciate it. I was under the impression that at this point everything fast that could be wrapped in C was already in that state in Python (2.7)? — user1071516, Feb 10 '17 at 22:21
@user1071516: *Your* code isn't in C, particularly the `(pair for pair in izip(x, y) if pair != two_As)`. That goes through a ton of redundant type checking, object creation, and dynamic allocation that a C implementation could bypass. — user2357112, Feb 10 '17 at 22:27
if you are willing to make your strings a different format, pandas/numpy can speed this up: `z = pd.DataFrame({'x':list(x), 'y':list(y)}); z['x1'] = np.where(z['x'] == z['y'], '.',z['x']); z['y1'] = np.where(z['x'] == z['y'], '.',z['y']);` But the speed is only apparent in larger strings than yours, thanks to the overhead of creating the initial dataframe. — jeremycg, Feb 10 '17 at 22:54
@jeremycg: yes, I think the overhead kills any speed advantage. I'm trying to do this in numpy now, but meanwhile, if anyone knows what the C equivalent would look like, I'll try implementing in Cython. I just happen not to know C well enough. — user1071516, Feb 10 '17 at 23:19

Given two strings, what is the fastest way to remove all positions at which they both match a given character?

0 Answers0