This is a speed question.
I have two strings x and y of the same length. At certain positions (indices) they may have the same character (e.g., "A"). I wish to remove these characters, and squeeze together what remains.
For example, denoting non-A characters with a period ("."):
x = "....A..A....AA...A"
y = ".A.....A....AA...."
new_x, new_y = f(x, y) should return:
new_x = "....A.........A"
new_y = ".A............."
Currently, I do this with a zip/unzip and a join, as in the code snippet below. But I find this to be too slow for my needs, as I need to do this a few hundred thousand times at a time:
import timeit
setup = '''
from itertools import izip
x = 20*"....A..A....AA...A" # my actual strings are length 300-400, thus the 20*
y = 20*".A.....A....AA...."
def f(x, y, two_As=("A", "A")):
new_x, new_y = izip(*(pair for pair in izip(x, y) if pair != two_As))
new_x = "".join(new_x)
new_y = "".join(new_y)
'''
>>> timeit.timeit('f(x, y)', setup=setup, number=int(1e5))
8.456903219223022
The few things I tried did not speed things up. Caching is an option (though as you might have guessed, it will not always be the same two strings!), but it seems as though this is a very simple operation and there might be a substantial speedup if done the right way.
(I note ~80% of the time is taken by the zipping/unzipping, and only 20% by the subsequent "".join() calls, but if that can be made faster, too, I'm all ears.)
I would imagine/unreasonably hope that this whole thing should take no more than 10 microseconds (~1 second for 100,000 calls).