51

I have two lists as follows.

mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]

I want to get the mean of the common elements in the two lists as follows.

myoutput = [["chocolate", 0.5], ["egg", 0.45]]

My current code is as follows

for item1 in mylist1:
    for item2 in mylist2:
        if item1[0] == item2[0]:
             print(np.mean([item1[1], item2[1]]))

However, since there are two for loops (O(n^2) complexity) this is very inefficient for very long lists. I am wondering if there is more standard/efficient way of doing this in Python.

RoadRunner
  • 25,803
  • 6
  • 42
  • 75
EmJ
  • 4,398
  • 9
  • 44
  • 105
  • 5
    Converting the items to a dictionary would make for a readable and Pythonic solution – moo May 10 '20 at 13:18
  • Perhaps you can get the means of each of the list and the just do something like `mean1/len(mylist1) + mean2/len(mylist2)` as this would get you the mean of the lists combined – dodekja May 11 '20 at 08:37
  • 2
    Is it a given that each key is always *unique* to each list? I think many of the answers make this assumption. It's probably a valid assumption, but just wanted to be sure in case this was just dummy data, and the "real" data might have duplicates (it's not entirely clear to me what these key,value pairs represent). – Roberto May 11 '20 at 12:53
  • (mydict1[i] + mydict2[i])/2 for i in set(mydict1) & set(mydict2) # after dict() – benxyzzy May 11 '20 at 15:07
  • thank you for the suggestions :) – EmJ May 13 '20 at 23:28

9 Answers9

37

You can do it in O(n) (single pass over each list) by converting 1 to a dict, then per item in the 2nd list access that dict (in O(1)), like this:

mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]

l1_as_dict = dict(mylist1)

myoutput = []
for item,price2 in mylist2:
    if item in l1_as_dict:
        price1 = l1_as_dict[item]
        myoutput.append([item, (price1+price2)/2])

print(myoutput)

Output:

[['chocolate', 0.5], ['egg', 0.45]]
Adam.Er8
  • 12,675
  • 3
  • 26
  • 38
17

An O(n) solution that will average all items.
Construct a dictionary with a list of the values and then average that dictionary afterwards:

In []:
d = {}
for lst in (mylist1, mylist2):
    for i, v in lst:
        d.setdefault(i, []).append(v)   # alternative use collections.defaultdict

[(k, sum(v)/len(v)) for k, v in d.items()]

Out[]:
[('lemon', 0.1), ('egg', 0.45), ('muffin', 0.3), ('chocolate', 0.5), ('milk', 0.2), ('carrot', 0.8)]

Then if you just want the common ones you can add a guard:

In []:
[(k, sum(v)/len(v)) for k, v in d.items() if len(v) > 1]

Out[]:
[('egg', 0.45), ('chocolate', 0.5)]

This extends to any number of lists and makes no assumption around the number of common elements.

AChampion
  • 29,683
  • 4
  • 59
  • 75
9

Here is one solution that uses collections.defaultdict to group the items and calculates the averages with statistics.mean:

from collections import defaultdict
from statistics import mean

mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]

d = defaultdict(list)
for lst in (mylist1, mylist2):
    for k, v in lst:
        d[k].append(v)

result = [[k, mean(v)] for k, v in d.items()]

print(result)
# [['lemon', 0.1], ['egg', 0.45], ['muffin', 0.3], ['chocolate', 0.5], ['milk', 0.2], ['carrot', 0.8]]

If we only want common keys, just check if the values are more than 1:

result = [[k, mean(v)] for k, v in d.items() if len(v) > 1]

print(result)
# [['egg', 0.45], ['chocolate', 0.5]]

We could also just build the result from set intersection:

mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]

d1, d2 = dict(mylist1), dict(mylist2)

result = [[k, (d1[k] + d2[k]) / 2] for k in d1.keys() & d2.keys()]

print(result)
# [['egg', 0.45], ['chocolate', 0.5]]
RoadRunner
  • 25,803
  • 6
  • 42
  • 75
8

You can use the Pandas library to avoid writing any sort of loops yourself.

Your code would be really concise and clean.

Install Pandas like: pip install pandas.

Then try this:

In [132]: import pandas as pd

In [109]: df1 = pd.DataFrame(mylist1)

In [110]: df2 = pd.DataFrame(mylist2)

In [117]: res = pd.merge(df1, df2, on=0)

In [121]: res['mean'] = res.mean(axis=1)

In [125]: res.drop(['1_x', '1_y'], 1, inplace=True)

In [131]: res.values.tolist()
Out[131]: [['egg', 0.45], ['chocolate', 0.5]]

Edit

Pandas is crazy fast because it uses numpy under the hood. Numpy implements highly efficient array operations.

Please check the post : Why is Pandas so madly fast? for more details on calculating mean through pure Python vs Pandas.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
7

To easily manipulate your values, I'd suggest using a dict, find the common keys, and compute the mean:

mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]

recipe_1 = dict(mylist1)  # {'lemon': 0.1, 'egg': 0.1, 'muffin': 0.3, 'chocolate': 0.5}
recipe_2 = dict(mylist2)  # {'chocolate': 0.5, 'milk': 0.2, 'carrot': 0.8, 'egg': 0.8}

common_keys = recipe_1.keys() & recipe_2.keys()  # {'chocolate', 'egg'}

myoutput = [[item, np.mean((recipe_1[item], recipe_2[item]))] for item in common_keys]
myoutput = [[item, (recipe_1[item] + recipe_2[item]) / 2] for item in common_keys]
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
azro
  • 53,056
  • 7
  • 34
  • 70
6

Convert lists to dicts

d_list1 = dict(mylist1)
d_list2 = dict(mylist2)

[[k, (v+d_list2[k])/2] for k, v in d_list1.items() if k in d_list2]
#[['egg', 0.45], ['chocolate', 0.5]]
Transhuman
  • 3,527
  • 1
  • 9
  • 15
6

You get the common keys from the two lists by using the set intersection method and then using a list comprehension calculate the mean:

mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]

dict1 = dict(mylist1)
dict2 = dict(mylist2)
res = [[key, (dict1.get(key)+dict2.get(key))/2] for key in set(dict1.keys()).intersection(set(dict2.keys()))]
print(res)

Output:

>> [['chocolate', 0.5], ['egg', 0.45]]
Fullstack Guy
  • 16,368
  • 3
  • 29
  • 44
5

You can do it in the time required for comuting set intersections which is apparently O(min(N1,N2)) where N1, N2 are the list lengths.

intersect = set([a[0] for a in mylist1]).intersection([a[0] for a in mylist2])
d1=dict(mylist1)
d2=dict(mylist2)
{i:(d1[i]+d2[i])/2 for i in intersect}
jeremy_rutman
  • 3,552
  • 4
  • 28
  • 47
  • 1
    I would be careful in how you describe the complexity. You have 2 list comprehension and 3 dictionary constructions. The set intersection is only one of your operations. Having said all that it is still only `O(n)`. – AChampion May 11 '20 at 04:32
  • *"Comuting"* is not an English word. Do you mean *[computing](https://en.wiktionary.org/wiki/computing#Verb)* or *[commuting](https://en.wiktionary.org/wiki/commute#Verb)* (*"2. (intransitive, mathematics) Of an operation, to be commutative, i.e. to have the property that changing the order of the operands does not change the result."*)? – Peter Mortensen May 11 '20 at 13:38
  • I don't think commuting the set intersections will make any difference, as the intersection operation already takes into account the shorter length list and does no more operations than required, so it wouldn't matter if you did a.intersection.b or b.intersection.a . However , this answer did commute [reduce to lesser length] the "time served" from O(n1*n2) to O(min(n1,n2)) so there's that to consider on the long drive to work. – jeremy_rutman May 11 '20 at 16:18
2

Here is a simple, very Pythonic solution:

result = [[x[0], (x[1] + y[1])/2] for x in mylist1 for y in mylist2 if x[0] == y[0]]

It probably is not the fastest solution, but it is faster by virtue of using Python list comprehension to iterate the lists and, since neither this solution nor the OP's will work with multiple instances of a list key value, it replaces the np.mean with a simple average of the two values.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mark
  • 29
  • 1