How to move list items to the front if they are in another list while keeping the order

Question

I have a two example lists,

vals = ["a", "c", "d", "e", "f", "g"]
xor  = ["c", "g"]

I want to sort vals list based on the xor list, i.e., the values in xor should be placed first in vals list in exact order. The rest of the values present in vals should stay in the same order.

Also, it is possible that values in xor might not be in vals in those cases just ignore those values. And also, in the case of duplicates, I only need one value.

Desired output:

vals = ["c", "g", "a", "d", "e", "f"]
# here a, d, e, f are not in xor so we keep them in same order as found in vals.

My Approach:

new_list = []
for x in vals:
    for y in xor:
        if x == y:
            new_list.append(x)
            
for x in vals:
    if x not in xor:
        new_list.append(x)

The vals list currently has around 800k words or phrases. The xor list has 300k words or phrases but it might increase later on. Some phrases are a bit long as well. What would be the most efficient way to address the problem?

user2390182 · Accepted Answer · 2020-12-20T07:44:59.030

5

Build a order dict of indeces in xor and use it as a sorting key:

order = {n: i for i, n in enumerate(xor)}

sorted(vals, key=lambda x: order.get(x, len(xor)))
# ['c', 'g', 'a', 'd', 'e', 'f']

Using len(vals) as a default value ensures all values not in xor will end up at the back. This of course assumes, that you want the values that are in xor to be sorted based on their order in xor (making the process O(M+NlogN)). Otherwise you can be faster (O(M+N)) with:

from operator import contains
from functools import partial
s = set(xor)
result = list(filter(partial(contains, s), vals))
result.extend(v for v in vals if v not in s)

Or in a more readable fashion:

s = set(xor)
result = [v for v in vals if v in s]
result += (v for v in vals if v not in s)

edited Dec 20 '20 at 07:44

answered Dec 19 '20 at 21:06

user2390182

72,016
6
67
89

What is the time complexity of this approach? I have a huge list. – user_12 Dec 19 '20 at 21:14
`O(M+NlogN)` with `M` the size of `xor` and `N` the size of `vals`. – user2390182 Dec 19 '20 at 21:15
I don't want to sort the values in` vals`, other values in vals that are not found in xor should stay in the same order – user_12 Dec 19 '20 at 21:26
Don't use dunder methods like this – juanpa.arrivillaga Dec 19 '20 at 21:28
If the second approach is O (M+N), this seems to be more efficient. – user_12 Dec 19 '20 at 21:38
@juanpa.arrivillaga I wouldn't usually, but given the OP's performance concerns, it does save the repeated method lookup. – user2390182 Dec 19 '20 at 21:38
@user_12 Yes, but the values won't be sorted by order within `xor`, as stated. If you don't need that, go for the second variant. – user2390182 Dec 19 '20 at 21:40
1

@schwobaseggl then something like `partial(operator.contains, a))` would be better. I took the liberty of editing – juanpa.arrivillaga Dec 19 '20 at 21:43
That's a valid point and appreciated. I am not sure though if that kills the advantage as `partial(operator.contains, s)` might do a plain `in` check under the hood. – user2390182 Dec 19 '20 at 21:48
@juanpa.arrivillaga Does using partial and contains in any way affect the time? – user_12 Dec 19 '20 at 21:49
@user_12 time it. Actually `.__contains__` might be faster. But it's a bad practice to use dunder methods directly. – juanpa.arrivillaga Dec 19 '20 at 21:50
@user_12 For most intents and purposes, we are just nitpicking here. The asymptotic complexity remains unchanged and the differences will be minute. – user2390182 Dec 19 '20 at 21:50
@schwobaseggl yes, I'm just triggered by `s.__contains__` hah – juanpa.arrivillaga Dec 19 '20 at 21:51
@juanpa.arrivillaga Actually, the first method `s.__contains__` without partial and contains is bit faster. – user_12 Dec 19 '20 at 21:57
@user_12 yes. I assumed as such, but as schwobaseggl stated, this are nitpicks. – juanpa.arrivillaga Dec 19 '20 at 22:02
@schwobaseggl Can you provide some comments in the code on how partial and contains work in this case? I am not very familiar with those functions. – user_12 Dec 19 '20 at 22:06
Shouldn't it be `len(xor)`? – superb rain Dec 19 '20 at 23:10
@superbrain Of course! I was first going with `len(order)`, but that wouldn't work if there were dupes in `xor`, then made it worse :D – user2390182 Dec 20 '20 at 07:46

dreamcrash · Answer 2 · 2020-12-19T21:48:06.793

1

Append all the values from xor that exist also in vals, with the list of all values of vals that do not exist on xor:

sorted_list = [v for v in xor if v in vals] + [v for v in vals if v not in xor]

another approach :

output = list(filter(lambda x: x in vals, xor)) + list(filter(lambda x: x not in xor, vals))

edited Dec 19 '20 at 21:48

answered Dec 19 '20 at 21:26

dreamcrash

47,137
25
94
117

Stuart · Answer 3 · 2020-12-22T18:14:48.130

Given the clarifications added to the question, several of the answers posted here do not actually provide the expected answer, and others are not efficient.

One solution is to take @dreamcrash's answer but using sets for the look-up:

def order_by1(vals, xor):
    set_vals = set(vals)
    set_xor = set(xor)
    return [v for v in xor if v in set_vals] + [v for v in vals if v not in set_xor]

This should increase efficiency by removing the time-consuming (O(n)) list lookup on each iteration of the loop.

Since you don't care about the order of the values that are not found in xor, a variant of this is to replace the 2nd list comprehension with a set operation:

def order_by2(vals, xor):
    set_vals = set(vals)
    return [v for v in xor if v in set_vals] + list(set_vals - set(xor))

Another solution is to use ordered sets

def order_by3(vals, xor):
    vals_set = OrderedSet(vals)
    return (OrderedSet(xor) & vals_set) | vals_set

Ordered sets use dictionaries under the hood. We can instead use dictionaries directly, making use of the fact that they are ordered:

def order_by4(vals, xor):
    """ Order list a by each item's position in list xor, if it is found in xor,
        otherwise by its position in list vals """
    d = {k: False for k in xor}
    for k in vals:
        d[k] = True
    return [k for k, v in d.items() if v]

The only other solution that actually does what is required is the first one in @schwobaseggl's answer. Timing these with lists of 800,000 and 300,000 random phrases of which 150,000 overlap:

import random
import string
import timeit

def random_phrase():
    return "".join(random.choices(string.ascii_letters, k=10))

def generate_lists(M, N, overlap):
    a = [random_phrase() for _ in range(M)] 
    b = ([random_phrase() for _ in range(N - overlap)] +
          random.sample(a, k=overlap))
    random.shuffle(b)
    return a, b

def dreamcrash(vals, xor):
    return [v for v in xor if v in vals] + [v for v in vals if v not in xor]

def schwobaseggl(vals, xor):
    order = {n: i for i, n in enumerate(xor)}
    len_xor = len(xor)   # saves some time when sorting
    return sorted(vals, key=lambda x: order.get(x, len_xor))

vals, xor = generate_lists(800000, 300000, overlap=150000) # this takes a while
for f in [dreamcrash, order_by1, order_by2, order_by3, order_by4, schwobaseggl]:
    print(f, end='...')
    print(timeit.timeit(stmt=f"{f.__name__}(vals, xor)", number=1, globals=globals()))

I gave up on timing dreamcrash because it was taking too long. order_by2 seems fastest, followed by order_by4, order_by1, and schwobaseggl, each taking around .5 - 1.5s on my computer. The ordered set solution is a lot slower. Checking whether a set contains an item, and setting an item in a dictionary, are both O(1) in the average case, O(n) in the worst case which explains why the dictionary and set based versions have similar performance.

Not sure but think it would be similar to ordinary sets - https://wiki.python.org/moin/TimeComplexity - so O(aM+bN) (a and b depending on overlap between xor and vals) on average, O(MN) worst case. In practice the implementation of ordered sets in python may slow things down. — Stuart, Dec 20 '20 at 03:20

danyroza · Answer 4 · 2020-12-19T21:17:11.277

0

This should also be one line solution for you:

vals = [1, 2, 5, 4, 3, 2, 11, 6]
xor = [10, 11]
new_list = xor + [elem for elem in vals if elem not in xor]

(edit: without sorted and correct variables)

edited Dec 19 '20 at 21:17

answered Dec 19 '20 at 21:11

danyroza

154
1
10

Should be fixed now. It is a matter of simple list generation then. – danyroza Dec 19 '20 at 21:17
From what he states, it is not really clear whether he wants them in the final list or not. Technically you could achieve this with another list generator instead of xor list at the start. – danyroza Dec 19 '20 at 21:22

superb rain · Answer 5 · 2020-12-19T23:36:03.047

-1

Another way to do it with sorting:

vals.sort(key=set(xor).__contains__, reverse=True)

I suspect it's faster than the others, but don't feel like trying to create test data that might resemble your actual data. (Partly because your question is still not entirely clear. I'm going with what your reference implementation does, i.e., both groups retaining their order in vals.)

edited Dec 19 '20 at 23:36

answered Dec 19 '20 at 23:00

superb rain

5,300
2
11
25

How to move list items to the front if they are in another list while keeping the order

5 Answers5