Python: merge lists or data frames and overwrite missing values

Question

Assuming I have the following lists:

list1 = ['MI', '', 'NY', '', 'AR', '']
list2 = ['', 'MS', '', 'OH', '', '']

Anywhere there is a missing value or an empty string in list1, I want to overwrite the empty string with a corresponding value in list2. Is there an efficient way to do this without having to iterate through each item in list1? Below is my current solution:

list1 = ['MI', '', 'NY', '', 'AR', '']
list2 = ['', 'MS', '', 'OH', '', '']

counter = 0

for each in list1:
    counter = counter + 1
    if len(each) == 0:
        list1[counter-1] = list2[counter-1]
print(list1)
>>> ['MI', 'MS', 'NY', 'OH', 'AR', '']

I tried to convert my lists to pandas data frames and used pandas.DataFrame.update() but didn't get the result I was looking for. A similar problem is solved here but in R.

no, you are going to have to iterate through. You can hide the iteration by using fancy list comps and such, but something will be iterating through those lists behind the scenes. — user2782067, Dec 03 '14 at 04:34

score 3 · Answer 1 · answered Dec 03 '14 at 04:36

3

There's a more 'Pythonic' way to do it (using list comprehensions), but you still get an iteration in the end:

[x or y for x, y in zip(list1, list2)]

answered Dec 03 '14 at 04:36

xbug

1,394
1
11
18

As this iterates over the two lists in parallel, then perhaps it's less efficient than the initial solution which only iterates list1 completely but only accesses specific indices of list2. Thoughts? – sedeh Dec 03 '14 at 04:51

Roman Pekar · Accepted Answer · 2014-12-03T14:51:42.350

you can use pandas pandas.Series.where() method, but I suppose there're still be an iteration:

>>> s1 = pd.Series(list1)
>>> s2 = pd.Series(list2)
>>> s1.where(s1 != '', s2)
0    MI
1    MS
2    NY
3    OH
4    AR
5

Concerning your original method, you don't have to have your own counter, btw, you can use enumerate() method:

>>> def isnull1(list1, list2):
...     res = []
...     for i, x in enumerate(list1):
...         if not x:
...             res.append(list2[i])
...         else:
...             res.append(x)
...     return res
... 
>>> isnull1(list1, list2)
['MI', 'MS', 'NY', 'OH', 'AR', '']

But even better solution would be to use zip() and map()

>>> map(lambda x: x[1] if not x[0] else x[0], zip(list1, list2))
['MI', 'MS', 'NY', 'OH', 'AR', '']

It's also better to use generators if you don't need list right away:

>>> def isnull2(list1, list2):
...     for i, x in enumerate(list1):
...         if not x:
...             yield list2[i]
...         else:
...             yield x
... 
>>> list(isnull2(list1, list2))
['MI', 'MS', 'NY', 'OH', 'AR', '']

or, use imap() and izip() from itertools:

>>> from itertools import izip, imap
>>> list(imap(lambda x: x[1] if not x[0] else x[0], izip(list1, list2)))
['MI', 'MS', 'NY', 'OH', 'AR', '']

Nice! Does this iterate s1 and s2 completely? If so, it's probably O(n2) whereas the solution in the question appears to be O(n). Further thoughts? — sedeh, Dec 03 '14 at 13:22
If it'd iterate s2 for each element of s1 then it would be O(n2). both solutions are O(n), you can test both to see which is faster. Pandas one could be faster because of faster access to the elements of the Series (which is numpy array underneath) — Roman Pekar, Dec 03 '14 at 13:29

score 0 · Answer 3 · answered Dec 03 '14 at 04:47

Maybe this can help:

def list_default(l1, l2):
    i1 = iter(l1)
    i2 = iter(l2)
    for i in i1:
        next_default = i2.next()
        if not i:
            yield next_default
        else:
            yield i

list1 = ['MI', '', 'NY', '', 'AR', '']
list2 = ['', 'MS', '', 'OH', '', '']

print(list(list_default(list1, list2)))
>>> ['MI', 'MS', 'NY', 'OH', 'AR', '']

You must iterate over sequence to find missing values, with the previous function you avoid to track index of lists.

Sorry for my English

Python: merge lists or data frames and overwrite missing values

3 Answers3