Given the clarifications added to the question, several of the answers posted here do not actually provide the expected answer, and others are not efficient.
One solution is to take @dreamcrash's answer but using sets for the look-up:
def order_by1(vals, xor):
set_vals = set(vals)
set_xor = set(xor)
return [v for v in xor if v in set_vals] + [v for v in vals if v not in set_xor]
This should increase efficiency by removing the time-consuming (O(n)) list lookup on each iteration of the loop.
Since you don't care about the order of the values that are not found in xor
, a variant of this is to replace the 2nd list comprehension with a set operation:
def order_by2(vals, xor):
set_vals = set(vals)
return [v for v in xor if v in set_vals] + list(set_vals - set(xor))
Another solution is to use ordered sets
def order_by3(vals, xor):
vals_set = OrderedSet(vals)
return (OrderedSet(xor) & vals_set) | vals_set
Ordered sets use dictionaries under the hood. We can instead use dictionaries directly, making use of the fact that they are ordered:
def order_by4(vals, xor):
""" Order list a by each item's position in list xor, if it is found in xor,
otherwise by its position in list vals """
d = {k: False for k in xor}
for k in vals:
d[k] = True
return [k for k, v in d.items() if v]
The only other solution that actually does what is required is the first one in @schwobaseggl's answer. Timing these with lists of 800,000 and 300,000 random phrases of which 150,000 overlap:
import random
import string
import timeit
def random_phrase():
return "".join(random.choices(string.ascii_letters, k=10))
def generate_lists(M, N, overlap):
a = [random_phrase() for _ in range(M)]
b = ([random_phrase() for _ in range(N - overlap)] +
random.sample(a, k=overlap))
random.shuffle(b)
return a, b
def dreamcrash(vals, xor):
return [v for v in xor if v in vals] + [v for v in vals if v not in xor]
def schwobaseggl(vals, xor):
order = {n: i for i, n in enumerate(xor)}
len_xor = len(xor) # saves some time when sorting
return sorted(vals, key=lambda x: order.get(x, len_xor))
vals, xor = generate_lists(800000, 300000, overlap=150000) # this takes a while
for f in [dreamcrash, order_by1, order_by2, order_by3, order_by4, schwobaseggl]:
print(f, end='...')
print(timeit.timeit(stmt=f"{f.__name__}(vals, xor)", number=1, globals=globals()))
I gave up on timing dreamcrash
because it was taking too long. order_by2
seems fastest, followed by order_by4
, order_by1
, and schwobaseggl
, each taking around .5 - 1.5s on my computer. The ordered set solution is a lot slower. Checking whether a set contains an item, and setting an item in a dictionary, are both O(1)
in the average case, O(n)
in the worst case which explains why the dictionary and set based versions have similar performance.