27

I have two lists, let's say:

keys1 = ['A', 'B', 'C', 'D', 'E',           'H', 'I']
keys2 = ['A', 'B',           'E', 'F', 'G', 'H',      'J', 'K']

How do I create a merged list without duplicates that preserve the order of both lists, inserting the missing elements where they belong? Like so:

merged = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']

Note that the elements can be compared against equality but not ordered (they are complex strings). The elements can't be ordered by comparing them, but they have an order based on their occurrence in the original lists.

In case of contradiction (different order in both input lists), any output containing all elements is valid. Of course with bonus points if the solution shows 'common sense' in preserving most of the order.

Again (as some comments still argue about it), the lists normally don't contradict each other in terms of the order of the common elements. In case they do, the algorithm needs to handle that error gracefully.

I started with a version that iterates over the lists with .next() to advance just the list containing the unmatched elements, but .next() just doesn't know when to stop.

merged = []
L = iter(keys1)
H = iter(keys2)
l = L.next()
h = H.next()

for i in range(max(len(keys1, keys2))):
  if l == h:
    if l not in merged:
      merged.append(l)
    l = L.next()
    h = H.next()

  elif l not in keys2:
    if l not in merged:
      merged.append(l)
    l = L.next()

  elif h not in keys1:
    if h not in merged:
      merged.append(h)
    h = H.next()

  else: # just in case the input is badly ordered
    if l not in merged:
      merged.append(l)
    l = L.next()
    if h not in merged:
      merged.append(h)
    h = H.next()   

print merged

This obviously doesn't work, as .next() will cause an exception for the shortest list. Now I could update my code to catch that exception every time I call .next(). But the code already is quite un-pythonic and this would clearly burst the bubble.

Does anyone have a better idea of how to iterate over those lists to combine the elements?

Bonus points if I can do it for three lists in one go.

double-beep
  • 5,031
  • 17
  • 33
  • 41
Chaos_99
  • 2,284
  • 2
  • 25
  • 29
  • 3
    I don't think that the list you want to compute is guaranteed to exist in general. What if `keys1 = ['A', 'B', 'D']; keys2 = ['D', 'C', 'B']`? – Ryan C. Thompson Jan 09 '13 at 16:21
  • 1
    how should an algorithm solve this case: `keys1 = ['A', '%', '*']` and `keys1 = ['A', '@', '?']` – tzelleke Jan 09 '13 at 16:22
  • @RyanThompson There are solutions, namely ['A', 'B', 'D', 'C', 'B'] and ['A', 'D', 'C', 'B', 'D'], but how to choose which one to return? And is an element allowed to be repeated in the output sequence? – Khaur Jan 09 '13 at 16:25
  • I guess that's the point. The question gives an example where the desired answer is made obvious by spacing and the use of alphabetic characters in order, but then says that the elements are unordered. So the example given doesn't fully specify the what the desired result is in the general case. – Ryan C. Thompson Jan 09 '13 at 18:00
  • 1
    Thinking some more, I wonder if the OP isn't effectively asking for a solution to the shortest common superstring problem? – Ryan C. Thompson Jan 09 '13 at 18:02
  • I've updated the question. @TheodrosZelleke: The output in your case should be either `['A', '%', '*', '@', '?']` or `['A', '@', '?', '%', '*']`. I don't care which. – Chaos_99 Jan 10 '13 at 09:50
  • @Ryan Thompson: The elements ARE ordered in the input lists. They just can't be ordered by any computational criteria. – Chaos_99 Jan 10 '13 at 10:15
  • @RyanThompson you're so right. And there's even a worst where `keys1 = ['A', 'B', 'D'];` and `keys2 = ['B', 'D', 'A'].` It's absurd in the mathematical proof sense to say that mergeList can preserve the order of keys1 and keys2 since both are contrary. – Stephane Rolland Jan 10 '13 at 16:33

7 Answers7

20

What you need is basically what any merge utility does: It tries to merge two sequences, while keeping the relative order of each sequence. You can use Python's difflib module to diff the two sequences, and merge them:

from difflib import SequenceMatcher

def merge_sequences(seq1,seq2):
    sm=SequenceMatcher(a=seq1,b=seq2)
    res = []
    for (op, start1, end1, start2, end2) in sm.get_opcodes():
        if op == 'equal' or op=='delete':
            #This range appears in both sequences, or only in the first one.
            res += seq1[start1:end1]
        elif op == 'insert':
            #This range appears in only the second sequence.
            res += seq2[start2:end2]
        elif op == 'replace':
            #There are different ranges in each sequence - add both.
            res += seq1[start1:end1]
            res += seq2[start2:end2]
    return res

Example:

>>> keys1 = ['A', 'B', 'C', 'D', 'E',           'H', 'I']
>>> keys2 = ['A', 'B',           'E', 'F', 'G', 'H',      'J', 'K']
>>> merge_sequences(keys1, keys2)
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']

Note that the answer you expect is not necessarily the only possible one. For example, if we change the order of sequences here, we get another answer which is just as valid:

>>> merge_sequences(keys2, keys1)
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'I']
interjay
  • 107,303
  • 21
  • 270
  • 254
  • 1
    +1 for the J in non-alphabetical order ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'I'], that's what I was trying to say in my comment. – Stephane Rolland Jan 09 '13 at 16:56
  • Looks very good to me. Thanks! @Stephane Rolland: You only notice the out-of-order J because those letters have a natural order. With random strings 'J', 'K' and 'I' have exactly the same property: they follow a 'K'. So the sollution is pretty valid. – Chaos_99 Jan 10 '13 at 10:00
  • Just checked. Works like a charm. And is reasonably fast in my case (about 1000 elements per list in 3 lists). You are the best! – Chaos_99 Jan 10 '13 at 10:51
  • 1
    And by (sheer :-)) curiosity how does it does work with let's says: ['K','A','O','S'] and ['S','O','A','K','I','N','G'] :-) – Stephane Rolland Jan 10 '13 at 16:22
  • 2
    @StephaneRolland: In that case you'll get some duplication, for example KAOSOAKING. – interjay Jan 10 '13 at 21:08
  • and in what other circumstances have you used this modules ? I feel it can be useful, but nothing came to my mind. – Stephane Rolland Jan 10 '13 at 21:21
  • 1
    @StephaneRolland It can be useful for making file comparison tools, source control software, incremental patching, etc. I don't think I've used it myself. – interjay Jan 11 '13 at 14:16
  • 1
    this fails when trying to merge 2 lists with reversed items. `merge_sequences([1,2,3], [3,2,1])` will end up in `[1,2,3,2,1]`. any suggestions? – aschmid00 May 21 '14 at 15:36
  • @aschmid00 If you don't want duplicates (I don't), remove duplicates from the list at the end while keeping the order: http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order – Sam Watkins Oct 17 '16 at 00:41
3

I suspect that you may be asking for a solution to the shortest common supersequence problem, which I believe is NP-hard in the general case of an arbitrary number of input sequences. I'm not aware of any libraries for solving this problem, so you might have to implement one by hand. Probably the quickest way to get to working code would be to take interjay's answer using difflib and then use reduce to run it on an arbitrary number of lists (make sure to specify the empty list as the 3rd argument to reduce).

Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159
  • Yes, the wiki definition seems about right for my problem. Thanks for pointing out the correct term and the note on using reduce for more then two input sequences. – Chaos_99 Jan 10 '13 at 10:08
2

I would use a Set (cf. python doc), that I'd fill with the elements of the two lists, one aafter the other.

And make a list from the Set when it's done.

Note that there is a contradiction/paradox in your question: you want to preserve order for elements that cannot be compared (only equality because "they are complex strings" as you said).

EDIT: the OP is right noticing that sets don't preserve order of insertion.

Stephane Rolland
  • 38,876
  • 35
  • 121
  • 169
  • From the example I guess the elements between matches should be inserted starting with the ones from the first list. – Khaur Jan 09 '13 at 16:19
  • A set is unordered. He wants to preserve the order of the two lists here. And there is no contradiction: It's possible to preserve ordering in a list without comparing the list's elements. – interjay Jan 09 '13 at 16:37
  • 1
    The ordering you refer about is the order of the list, not the order of the item's value ( since they are not orderly comparable). The question is biaised because it represents the two list with the implicit alphabetical order which is not applicable according to the OP (because these are complexe strings...) A + iB ? :-) – Stephane Rolland Jan 09 '13 at 16:46
  • +1 for getting the difference between the order of elements and orderable elements. (-1 for the set) – Chaos_99 Jan 10 '13 at 10:02
  • thx for pointing me out that sets would not preserve the order of insertion. – Stephane Rolland Jan 10 '13 at 16:41
  • Hi Stephane. Your updated answer now onl calculates the common elements of both lists. Not what I asked for. – Chaos_99 Jan 21 '13 at 07:46
  • you're right, my added code is just false. I remove that part. – Stephane Rolland Jan 21 '13 at 09:28
1

By using only lists, you can achieve this with few simple for loops and .copy():

def mergeLists(list1, list2):
    # Exit if list2 is empty
    if not len(list2):
        return list1
    # Copy the content of list2 into merged list
    merged = list2.copy()

    # Create a list for storing temporary elements
    elements = []
    # Create a variable for storing previous element found in both lists
    previous = None

    # Loop through the elements of list1
    for e in list1:
        # Append the element to "elements" list if it's not in list2
        if e not in merged:
            elements.append(e)

        # If it is in list2 (is a common element)
        else:

            # Loop through the stored elements
            for x in elements:
                # Insert all the stored elements after the previous common element
                merged.insert(previous and merged.index(previous) + 1 or 0, x)
            # Save new common element to previous
            previous = e
            # Empty temporary elements
            del elements[:]

    # If no more common elements were found but there are elements still stored
    if len(elements)
        # Insert them after the previous match
        for e in elements:
            merged.insert(previous and merged.index(previous) + 1 or 0, e)
    # Return the merged list
    return merged

In [1]: keys1 = ["A", "B",      "D",      "F", "G", "H"]
In [2]: keys2 = ["A",      "C", "D", "E", "F",      "H"]
In [3]: mergeLists(keys1, keys2)
Out[3]: ["A", "B", "C", "D", "E", "F", "G", "H"]

English is not my first language, and this one is pretty hard to explain, but if you care about the explanation, here's what it does:

  • There's a local list called elements which can store temporary elements.
  • There's a local variable called previous which stores the previous element that was in both lists.
  • When ever it finds an element that is NOT in list2 but is in list1, it will append that element to elements list and continue the loop.
  • Once it hits an element that is in both lists, it loops through the elements list, appending all elements after previous element to list2.
  • The new match is then stored into previous and elements is reset to [] and the loop continues.
  • Beginning of the lists and end of the lists are counted as a common element, if first or last element is not a common element in both lists.

This way it will always follow this format:

  1. Previous common element
  2. Elements from list1, between two common elements
  3. Elements in list2, between two common elements
  4. New common element

So for example:

l1 = ["A", "B", "C",      "E"]
l2 = ["A",           "D", "E"]
  1. The revious common element A will be first in the merged list.
  2. Elements from l1 between the previous common element A and the new common element E will be inserted right after A.
  3. Elements from l2 between the previous common elmeent A and the new common elmeent E will be inserted right after the elements from l1.
  4. The new common element E will be last element.
  5. Back to step 1 if more common elements found.

    ["A", "B", "C", "D", "E"]

  • Neither of these solutions preserve the order of the lists. The second solution is close but can change the order of elements in the second list. – interjay Jan 10 '13 at 09:41
  • As interjay said: with sets, the order is not preserved. The second solutions does not interleave, but only append. – Chaos_99 Jan 10 '13 at 10:09
  • Fixed it, and removed the `set` option. –  Jan 10 '13 at 17:00
1

I recently had stumbled upon a similar issue while implementing a feature. I tried to clearly define the problem statement first. If I understand right, here is the problem statement

Problem Statement

Write a function merge_lists which will merge a list of lists with overlapping items, while preserving the order of items.

Constraints

  1. If item A comes before item B in all the lists where they occur together, then item A must precede item B in the final list also

  2. If item A and item B interchange order in different lists, ie in some lists A precedes B and in some others B precedes A, then the order of A and B in the final list should be the same as their order in the first list where they occur together. That is, if A precedes B in l1 and B precedes A in l2, then A should precede B in final list

  3. If Item A and Item B do not occur together in any list, then their order must be decided by the position of the list in which each one occurs first. That is, if item A is in l1 and l3, item B is in l2 and l6, then the order in the final list must be A then B

Test case 1:

Input:

l1 = ["Type and Size", "Orientation", "Material", "Locations", "Front Print Type", "Back Print Type"]

l2 = ["Type and Size", "Material", "Locations", "Front Print Type", "Front Print Size", "Back Print Type", "Back Print Size"]

l3 = ["Orientation", "Material", "Locations", "Color", "Front Print Type"]

merge_lists([l1,l2,l3])

Output:

['Type and Size', 'Orientation', 'Material', 'Locations', 'Color', 'Front Print Type', 'Front Print Size', 'Back Print Type', 'Back Print Size']

Test case 2:

Input:

l1 = ["T", "V", "U", "B", "C", "I", "N"]

l2 = ["Y", "V", "U", "G", "B", "I"]

l3 = ["X", "T", "V", "M", "B", "C", "I"]

l4 = ["U", "P", "G"]

merge_lists([l1,l2,l3, l4])

Output:

['Y', 'X', 'T', 'V', 'U', 'M', 'P', 'G', 'B', 'C', 'I', 'N']

Test case 3:

Input:

l1 = ["T", "V", "U", "B", "C", "I", "N"]

l2 = ["Y", "U", "V", "G", "B", "I"]

l3 = ["X", "T", "V", "M", "I", "C", "B"]

l4 = ["U", "P", "G"]

merge_lists([l1,l2,l3, l4])

Output:

['Y', 'X', 'T', 'V', 'U', 'M', 'P', 'G', 'B', 'C', 'I', 'N']

Solution

I arrived at a reasonable solution which solved it correctly for all the data I had. (It might be wrong for some other data set. Will leave it for others to comment that). Here is the solution

def remove_duplicates(l):
    return list(set(l))

def flatten(list_of_lists):
    return [item for sublist in list_of_lists for item in sublist]

def difference(list1, list2):
    result = []
    for item in list1:
        if item not in list2:
            result.append(item)
    return result

def preceding_items_list(l, item):
    if item not in l:
        return []
    return l[:l.index(item)]

def merge_lists(list_of_lists):
    final_list = []
    item_predecessors = {}

    unique_items = remove_duplicates(flatten(list_of_lists))
    item_priorities = {}

    for item in unique_items:
        preceding_items = remove_duplicates(flatten([preceding_items_list(l, item) for l in list_of_lists]))
        for p_item in preceding_items:
            if p_item in item_predecessors and item in item_predecessors[p_item]:
                preceding_items.remove(p_item)
        item_predecessors[item] = preceding_items
    print "Item predecessors ", item_predecessors

    items_to_be_checked = difference(unique_items, item_priorities.keys())
    loop_ctr = -1
    while len(items_to_be_checked) > 0:
        loop_ctr += 1
        print "Starting loop {0}".format(loop_ctr)
        print "items to be checked ", items_to_be_checked
        for item in items_to_be_checked:
            predecessors = item_predecessors[item]
            if len(predecessors) == 0:
                item_priorities[item] = 0
            else:
                if all(pred in item_priorities for pred in predecessors):
                    item_priorities[item] = max([item_priorities[p] for p in predecessors]) + 1
        print "item_priorities at end of loop ", item_priorities
        items_to_be_checked = difference(unique_items, item_priorities.keys())
        print "items to be checked at end of loop ", items_to_be_checked
        print

    final_list = sorted(unique_items, key=lambda item: item_priorities[item])
    return final_list

I've also open sourced the code as a part of the library named toolspy. So you can just do this

pip install toolspy

from toolspy import merge_lists
lls=[['a', 'x', 'g'], ['x', 'v', 'g'], ['b', 'a', 'c', 'x']]
merge_lists(lls)
suryasankar
  • 4,821
  • 2
  • 14
  • 13
0

Here's a C# solution I came up with -- using an extension method -- for the case where the two lists might not contain the same type of elements, so it takes a compare method and a selector method (that returns an object of the target type given the source object). In this case, the first list ("me") is modified to contain the final result, but it could be modified to create a separate list.

public static class ListExtensions
{
    /// <summary>
    /// Merges two sorted lists containing potentially different types of objects, resulting in a single
    /// sorted list of objects of type T with no duplicates.
    /// </summary>
    public static void MergeOrderedList<TMe, TOther>(this List<TMe> me, IReadOnlyList<TOther> other, Func<TMe, TOther, int> compare = null, Func<TOther, TMe> selectT = null)
    {
        if (other == null)
            throw new ArgumentNullException(nameof(other));
        if (compare == null)
        {
            if (typeof(TMe).GetInterfaces().Any(i => i == typeof(IComparable<TOther>)))
            {
                compare = (a, b) => ((IComparable<TOther>)a).CompareTo(b);
            }
            else
            {
                throw new ArgumentNullException(nameof(compare),
                    "A comparison method must be supplied if no default comparison exists.");
            }
        }

        if (selectT == null)
            if (typeof(TMe).IsAssignableFrom(typeof(TOther)))
            {
                selectT = o => (TMe)(o as object);
            }
            else
            {
                throw new ArgumentNullException(nameof(selectT),
                    $"A selection method must be supplied if the items in the other list cannot be assigned to the type of the items in \"{nameof(me)}\"");
            }

        if (me.Count == 0)
        {
            me.AddRange(other.Select(selectT));
            return;
        }

        for (int o = 0, m = 0; o < other.Count; o++)
        {
            var currentOther = other[o];
            while (compare(me[m], currentOther) < 0 && ++m < me.Count) {}

            if (m == me.Count)
            {
                me.AddRange(other.Skip(o).Select(selectT));
                break;
            }

            if (compare(me[m], currentOther) != 0)
                me.Insert(m, selectT(currentOther));
        }
    }
}

Note: I did write unit tests for this, so it's solid.

Tom Bogle
  • 464
  • 4
  • 18
0

Assuming the problem is to keep the same relative order of as many common items as possible.

Consider the nodes of a graph represent the index pairs m, n having same value in the corresponding lists. E.g. [a, b, c] and [b, a, c] => [(0, 1), (1, 0), (2, 2)]

The relative order of two nodes m, n and m', n' can be satisfied if and only if (m < m' and n < n') or (m > m' and n > n'). In previous example (0, 1), (1, 0) do not satisfy this condition and therefore it is impossible to satisfy the relative order of a and b in both lists. Whereas (1, 0), (2, 2) satisfy this property and therefore it is possible to preserve the order of a and c.

Based on this condition, find edges between all node pairs O(n^2). To find the most optimal arrangement, find the largest maximal clique (which is NP-complete O(3^(n/3))) using Bron–Kerbosch algorithm. The resultant same value index pairs can be used as the anchors to generate the merged list.

If approximate ordering is acceptable for a polynomial solution, below method uses union-find (with path compression and rank optimization) to approximate the maximal clique and runs in O(n^2) time and takes O(n) space.

from collections import defaultdict

def find(cur, d):
    path = []
    while d[cur] != cur:
        path.append(cur)
        cur = d[cur]
    for n in path:
        d[n] = cur
    return cur

def f(o, n):
    if o == n:
        return o

    first_list = list(reversed(o))
    second_list = list(reversed(n))
    first_list_dict = {v: i for i, v in enumerate(first_list)}
    common = []
    for i, v in enumerate(second_list):
        if v in first_list_dict:
            common.append((first_list_dict[v], i))
    if not common:
        o.extend(n)
        return o
    if len(common) == len(first_list):
        return list({v: None for l in (o, n) for v in l})
    if len(common) == len(second_list):
        return o


    d = {p:p for p in common}
    rank = defaultdict(int)
    for i, p1 in enumerate(common):
        for p2 in common[i+1:]:
            if (p1[0] - p2[0]) * (p1[1] - p2[1]) > 0:
                h1, h2 = find(p1, d), find(p2, d)
                if rank[h1] > rank[h2]:
                    h1, h2 = h2, h1
                elif rank[h1] == rank[h2]:
                    rank[h2] += 1
                d[h1] = h2

    freq = defaultdict(list)
    for p in common:
        freq[find(p, d)].append(p)
    largest_clique = sorted(max(freq.values(), key=len))

    res = []
    seen = set()
    l_idx1, l_idx2 = 0, 0
    while largest_clique:
        idx1, idx2 = largest_clique.pop()
        new = first_list[l_idx1-1:idx1:-1]
        new.extend(second_list[l_idx2-1:idx2:-1])
        new.append(first_list[idx1])
        res.extend(v for v in new if v not in seen)
        seen.update(new)
        l_idx1, l_idx2 = idx1, idx2

    return res


for o, n in (
        [[0, 1, 2, 3, 4, 5], [5, 0, 1, 3, 4]],
        [[0, 1, 2, 3, 4, 5], [7, 3, 1, 2, 4, 5]],
        [[0, 1, 2, 3, 4, 5], [3, 4, 5, 0, 1, 2]],
        [["A", "B", "C", "E"], ["A", "D", "E"]],
        [["A", "B", "D", "F", "G", "H"], ["A", "C", "D", "E", "F", "H"]],
        [[0, 1, 2, 3, 4], [5, 6, 7, 8]],
        ):
    print(f"{str(o): <30}, {str(n): <30} => {str(f(o, n)): >40}")

Gives:

[0, 1, 2, 3, 4, 5]            , [5, 0, 1, 3, 4]                =>                       [0, 1, 2, 3, 4, 5]
[0, 1, 2, 3, 4, 5]            , [7, 3, 1, 2, 4, 5]             =>                    [0, 7, 3, 1, 2, 4, 5]
[0, 1, 2, 3, 4, 5]            , [3, 4, 5, 0, 1, 2]             =>                       [0, 1, 2, 3, 4, 5]
['A', 'B', 'C', 'E']          , ['A', 'D', 'E']                =>                ['A', 'B', 'C', 'D', 'E']
['A', 'B', 'D', 'F', 'G', 'H'], ['A', 'C', 'D', 'E', 'F', 'H'] => ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
[0, 1, 2, 3, 4]               , [5, 6, 7, 8]                   =>              [0, 1, 2, 3, 4, 5, 6, 7, 8]
tejasvi88
  • 635
  • 7
  • 15