2

I have a two example lists,

vals = ["a", "c", "d", "e", "f", "g"]
xor  = ["c", "g"] 

I want to sort vals list based on the xor list, i.e., the values in xor should be placed first in vals list in exact order. The rest of the values present in vals should stay in the same order.

Also, it is possible that values in xor might not be in vals in those cases just ignore those values. And also, in the case of duplicates, I only need one value.

Desired output:

vals = ["c", "g", "a", "d", "e", "f"]
# here a, d, e, f are not in xor so we keep them in same order as found in vals.

My Approach:

new_list = []
for x in vals:
    for y in xor:
        if x == y:
            new_list.append(x)
            
for x in vals:
    if x not in xor:
        new_list.append(x)

The vals list currently has around 800k words or phrases. The xor list has 300k words or phrases but it might increase later on. Some phrases are a bit long as well. What would be the most efficient way to address the problem?

Georgy
  • 12,464
  • 7
  • 65
  • 73
user_12
  • 1,778
  • 7
  • 31
  • 72

5 Answers5

5

Build a order dict of indeces in xor and use it as a sorting key:

order = {n: i for i, n in enumerate(xor)}

sorted(vals, key=lambda x: order.get(x, len(xor)))
# ['c', 'g', 'a', 'd', 'e', 'f']

Using len(vals) as a default value ensures all values not in xor will end up at the back. This of course assumes, that you want the values that are in xor to be sorted based on their order in xor (making the process O(M+NlogN)). Otherwise you can be faster (O(M+N)) with:

from operator import contains
from functools import partial
s = set(xor)
result = list(filter(partial(contains, s), vals))
result.extend(v for v in vals if v not in s)

Or in a more readable fashion:

s = set(xor)
result = [v for v in vals if v in s]
result += (v for v in vals if v not in s)
user2390182
  • 72,016
  • 6
  • 67
  • 89
1

Append all the values from xor that exist also in vals, with the list of all values of vals that do not exist on xor:

sorted_list = [v for v in xor if v in vals] + [v for v in vals if v not in xor]

another approach :

output = list(filter(lambda x: x in vals, xor)) + list(filter(lambda x: x not in xor, vals))
dreamcrash
  • 47,137
  • 25
  • 94
  • 117
1

Given the clarifications added to the question, several of the answers posted here do not actually provide the expected answer, and others are not efficient.

One solution is to take @dreamcrash's answer but using sets for the look-up:

def order_by1(vals, xor):
    set_vals = set(vals)
    set_xor = set(xor)
    return [v for v in xor if v in set_vals] + [v for v in vals if v not in set_xor]

This should increase efficiency by removing the time-consuming (O(n)) list lookup on each iteration of the loop.

Since you don't care about the order of the values that are not found in xor, a variant of this is to replace the 2nd list comprehension with a set operation:

def order_by2(vals, xor):
    set_vals = set(vals)
    return [v for v in xor if v in set_vals] + list(set_vals - set(xor))

Another solution is to use ordered sets

def order_by3(vals, xor):
    vals_set = OrderedSet(vals)
    return (OrderedSet(xor) & vals_set) | vals_set

Ordered sets use dictionaries under the hood. We can instead use dictionaries directly, making use of the fact that they are ordered:

def order_by4(vals, xor):
    """ Order list a by each item's position in list xor, if it is found in xor,
        otherwise by its position in list vals """
    d = {k: False for k in xor}
    for k in vals:
        d[k] = True
    return [k for k, v in d.items() if v]

The only other solution that actually does what is required is the first one in @schwobaseggl's answer. Timing these with lists of 800,000 and 300,000 random phrases of which 150,000 overlap:

import random
import string
import timeit

def random_phrase():
    return "".join(random.choices(string.ascii_letters, k=10))

def generate_lists(M, N, overlap):
    a = [random_phrase() for _ in range(M)] 
    b = ([random_phrase() for _ in range(N - overlap)] +
          random.sample(a, k=overlap))
    random.shuffle(b)
    return a, b

def dreamcrash(vals, xor):
    return [v for v in xor if v in vals] + [v for v in vals if v not in xor]

def schwobaseggl(vals, xor):
    order = {n: i for i, n in enumerate(xor)}
    len_xor = len(xor)   # saves some time when sorting
    return sorted(vals, key=lambda x: order.get(x, len_xor))

vals, xor = generate_lists(800000, 300000, overlap=150000) # this takes a while
for f in [dreamcrash, order_by1, order_by2, order_by3, order_by4, schwobaseggl]:
    print(f, end='...')
    print(timeit.timeit(stmt=f"{f.__name__}(vals, xor)", number=1, globals=globals()))

I gave up on timing dreamcrash because it was taking too long. order_by2 seems fastest, followed by order_by4, order_by1, and schwobaseggl, each taking around .5 - 1.5s on my computer. The ordered set solution is a lot slower. Checking whether a set contains an item, and setting an item in a dictionary, are both O(1) in the average case, O(n) in the worst case which explains why the dictionary and set based versions have similar performance.

Stuart
  • 9,597
  • 1
  • 21
  • 30
  • What is the time complexity of the second approach? – user_12 Dec 19 '20 at 23:01
  • Not sure but think it would be similar to ordinary sets - https://wiki.python.org/moin/TimeComplexity - so O(aM+bN) (a and b depending on overlap between xor and vals) on average, O(MN) worst case. In practice the implementation of ordered sets in python may slow things down. – Stuart Dec 20 '20 at 03:20
  • 1
    See edit for a possibly more efficient algorithm – Stuart Dec 20 '20 at 16:12
  • Edited again with some performance tests. – Stuart Dec 22 '20 at 03:04
0

This should also be one line solution for you:

vals = [1, 2, 5, 4, 3, 2, 11, 6]
xor = [10, 11]
new_list = xor + [elem for elem in vals if elem not in xor]

(edit: without sorted and correct variables)

danyroza
  • 154
  • 1
  • 10
  • Should be fixed now. It is a matter of simple list generation then. – danyroza Dec 19 '20 at 21:17
  • From what he states, it is not really clear whether he wants them in the final list or not. Technically you could achieve this with another list generator instead of xor list at the start. – danyroza Dec 19 '20 at 21:22
-1

Another way to do it with sorting:

vals.sort(key=set(xor).__contains__, reverse=True)

I suspect it's faster than the others, but don't feel like trying to create test data that might resemble your actual data. (Partly because your question is still not entirely clear. I'm going with what your reference implementation does, i.e., both groups retaining their order in vals.)

superb rain
  • 5,300
  • 2
  • 11
  • 25