8

I have two lists of values of the same sortable type, which are sorted in ascending order, but (i) they don't have the same length, and (ii) entries present in one list can be missing from the other and vice versa. However I know that the majority of values in one list are present in the other, and that there are no duplicates in any list.

So we may have this situation:

list1 = [value1-0, value1-1, value1-2, value1-3]
list2 = [value2-0, value2-1, value2-2]

If it happens that the order of the values from both lists is:

value1-0 < (value1-1 = value2-0) < value2-1 < value1-2 < value1-3 < value2-2

we can give combined sorted value names to the values across the two lists, e.g.:

valueA < valueB < valueC < valueD < valueE < valueF

so that the two lists can be written as:

list1 = [valueA, valueB, valueD, valueE]
list2 = [valueB, valueC, valueF]

Given this I want the lists to become:

new_list1 = [valueA,    valueB, "MISSING", valueD,    valueE,    "MISSING"]
new_list2 = ["MISSING", valueB, valueC,    "MISSING", "MISSING", valueF   ]

Can anyone help?

EDIT: The original question referred to datetime objects in particular (hence the comments specific to datetimes), but has been generalized to any sortable type.

Orestis Tsinalis
  • 351
  • 3
  • 14
  • How can you determine the order of the final lists when you don't know the relative order of the datetime objects? i.e. what's forcing the last two elements to be `E` then `F`? Why not `F` then `E`? – arshajii Aug 18 '13 at 21:31
  • Sorry, this might have been a bit confusing. I saw the answer you deleted and it was what I wanted, but voted it after you deleted it. What I meant is that, for example, if you look at `list1` _within itself_, then the absolute order is: `{A -> A, B -> B, D -> C, E -> D}`. The combined (or relative) numbering between the two lists that I put in the question is essentially achieved _after_ the solution is applied. It is `(E,F)` and not `(F,E)` because `datetimeE < datetimeF`. Does this make sense? I'l try to edit the question to make it clear. – Orestis Tsinalis Aug 18 '13 at 21:35
  • For completeness, see [pandas](http://pandas.pydata.org/pandas-docs/dev/merging.html). It was written exactly for this kind of data manipulation. – Caleb Hattingh Aug 18 '13 at 22:44

3 Answers3

8

How about something like this:

set1 = set(list1)
set2 = set(list2)
total = sorted(set1|set2)

new_list1 = [x if x in set1 else "MISSING" for x in total]
new_list2 = [x if x in set2 else "MISSING" for x in total]
arshajii
  • 127,459
  • 24
  • 238
  • 287
  • Since it is obvious that the elements don't need to be `datetime` objects, but of any sortable type, shall I generalize the question? – Orestis Tsinalis Aug 18 '13 at 21:49
  • @OrestisTsinalis That's 100% up to you. It might be more helpful for others who stumble upon your question in the future if you generalize it, but I'm sure most could make these generalizations themselves. – arshajii Aug 18 '13 at 21:53
  • This won't work if there are duplicate elements the source lists. – Petr Viktorin Aug 18 '13 at 22:03
  • @PetrViktorin *"there are no duplicates in any list"* – arshajii Aug 18 '13 at 22:06
  • Generalizing to any number of lists (using [this stackoverflow answer](http://stackoverflow.com/a/2151553/786235)): `>>> lists = [list1, list2, ..., listN] >>> total = sorted(set().union(*lists)) >>> new_lists = [[x if x in my_list else "MISSING" for x in total] for my_list in lists]` – Orestis Tsinalis Aug 18 '13 at 22:41
7

This problem piqued my interest, so I wrote an overly generic solution.

Here's a function that

  • aligns any number of sequences
  • works on iterators, so it can efficiently handle long (or infinite) sequences
  • supports repeated values
  • is compatible with Python 2 and 3 (although I'd use align_iterables(*inputs, missing_value=None) if I didn't care about historical Python versions)

import itertools

def align_iterables(inputs, missing=None):
    """Align sorted iterables

    Yields tuples with values from the respective `inputs`, placing
    `missing` if the value does not exist in the corresponding
    iterable.

    Example: align_generator('bc', 'bf', '', 'abf') yields:
        (None, None, None, 'a')
        ('b', 'b', None, 'b')
        ('c', None, None, None)
        (None, 'f', None, 'f')
    """
    End = object()
    iterators = [itertools.chain(i, [End]) for i in inputs]
    values = [next(i) for i in iterators]
    while not all(v is End for v in values):
        smallest = min(v for v in values if v is not End)
        yield tuple(v if v == smallest else missing for v in values)
        values = [next(i) if v == smallest else v
                  for i, v in zip(iterators, values)]

# An adapter for this question's problem:

def align_two_lists(list1, list2, missing="MISSING"):
    value = list(zip(*list(align_iterables([list1, list2], missing=missing))))
    if not value:
        return [[], []]
    else:
        a, b = value
        return [list(a), list(b)]

# A set of tests for the question's problem:

if __name__ == '__main__':
    assert align_two_lists('abcef', 'abcdef', '_') == [['a', 'b', 'c', '_', 'e', 'f'], ['a', 'b', 'c', 'd', 'e', 'f']]
    assert align_two_lists('a', 'abcdef', '_') == [['a', '_', '_', '_', '_', '_'], ['a', 'b', 'c', 'd', 'e', 'f']]
    assert align_two_lists('abcdef', 'a', '_') == [['a', 'b', 'c', 'd', 'e', 'f'], ['a', '_', '_', '_', '_', '_']]
    assert align_two_lists('', 'abcdef', '_') == [['_', '_', '_', '_', '_', '_'], ['a', 'b', 'c', 'd', 'e', 'f']]
    assert align_two_lists('abcdef', '', '_') == [['a', 'b', 'c', 'd', 'e', 'f'], ['_', '_', '_', '_', '_', '_']]
    assert align_two_lists('ace', 'abcdef', '_') == [['a', '_', 'c', '_', 'e', '_'], ['a', 'b', 'c', 'd', 'e', 'f']]
    assert align_two_lists('bdf', 'ace', '_') == [['_', 'b', '_', 'd', '_', 'f'], ['a', '_', 'c', '_', 'e', '_']]
    assert align_two_lists('ace', 'bdf', '_') == [['a', '_', 'c', '_', 'e', '_'], ['_', 'b', '_', 'd', '_', 'f']]
    assert align_two_lists('aaacd', 'acd', '_') == [['a', 'a', 'a', 'c', 'd'], ['a', '_', '_', 'c', 'd']]
    assert align_two_lists('acd', 'aaacd', '_') == [['a', '_', '_', 'c', 'd'], ['a', 'a', 'a', 'c', 'd']]
    assert align_two_lists('', '', '_') == [[], []]

    list1 = ["datetimeA", "datetimeB", "datetimeD", "datetimeE"]
    list2 = ["datetimeB", "datetimeC", "datetimeD", "datetimeF"]

    new_list1 = ["datetimeA", "datetimeB", "MISSING", "datetimeD", "datetimeE", "MISSING"]
    new_list2 = ["MISSING", "datetimeB", "datetimeC", "datetimeD", "MISSING", "datetimeF"]

    assert align_two_lists(list1, list2) == [new_list1, new_list2]

# And some extra tests:

    # Also test multiple generators
    for expected, got in zip(
            [(None, None, None, 'a'),
             ('b', 'b', None, 'b'),
             ('c', None, None, None),
             (None, 'f', None, 'f')],
            align_iterables(['bc', 'bf', '', 'abf'])):
        assert expected == got

    assert list(align_iterables([])) == []

    # And an infinite generator
    for expected, got in zip(
            [(0, 0),
             ('X', 1),
             (2, 2),
             ('X', 3),
             (4, 4)],
            align_iterables([itertools.count(step=2), itertools.count()], missing='X')):
        assert expected == got
Petr Viktorin
  • 65,510
  • 9
  • 81
  • 81
1

You could try:

new_list1=[]
new_list2=[]

i=j=0
while True:
    print '1 ' + str(new_list1) +' '+str(i)
    print '2 ' + str(new_list2) +' '+str(j)
    if list1[i]==list2[j]:
        new_list1 += [list1[i]]
        new_list2 += [list2[j]]
        i=i+1
        j=j+1
    elif list1[i]>list2[j]:
        new_list1 += ["MISSING"]
        new_list2 += [list2[j]]
        j=j+1
    else: # list1[i]<list2[j]
        new_list1 += [list1[i]]
        new_list2 += ["MISSING"]
        i=i+1
    if i>=len(list1) or j>=len(list2):
        break
while i<len(list1):
    new_list1 += [list1[i]]
    new_list2 += ["MISSING"]
    i=i+1
while j<len(list2):
    new_list1 += ["MISSING"]
    new_list2 += [list2[j]]
    j=j+1

It looks like a lot of code but it should work and loops through the lists just once.

Lorenzo Baracchi
  • 1,808
  • 1
  • 13
  • 18
  • Just move the `if i>=len(list1) or j>=len(list2):` block to the beginning of the loop, so empty lists are handled properly. – Petr Viktorin Aug 18 '13 at 23:13