Combining three lists in Python with sorting

Question

How to efficiently and smartly combine 3 lists in the way like below?

 sex = ['M', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F']
 actresses = ['Natalie Portman', 'Anne Hathaway', 'Talia Shire', 'Diane Keaton', 'Keira Knightley', 'Uma Thurman']
 actors = ['Morgan Freeman', 'Leonardo DiCaprio', 'Robert De Niro', 'Brad Pitt']

Result:

[('M', 'Morgan Freeman'),
 ('M', 'Leonardo DiCaprio'),
 ('F', 'Natalie Portman'),
 ('F', 'Anne Hathaway'),
 ('M', 'Robert De Niro'),
 ('F', 'Talia Shire'),
 ('M', 'Brad Pitt'),
 ('F', 'Diane Keaton'),
 ('F', 'Keira Knightley'),
 ('F', 'Uma Thurman')]

My solution:

sex = ['M', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F']
actresses = ['Natalie Portman', 'Anne Hathaway', 'Talia Shire', 'Diane Keaton', 'Keira Knightley', 'Uma Thurman', ]
actors = ['Morgan Freeman', 'Leonardo DiCaprio', 'Robert De Niro', 'Brad Pitt']
result = []

for s in sex:
    if s == 'F':
        result.append((s, actresses.pop(0)))
    elif s == 'M':
        result.append((s, actors.pop(0)))

print(f'result = {result}')

What is the best way for a long lists (e.g. 1 million items)?

Have you considered using pandas DataFrames? You could simply store the actresses into a dataframe, the actors into another dataframe, add a gender column to each dataframe (`df_actors['gender'] = 'M'` and `df_actresses['gender'] = 'F'`), and then merge the dataframes. — bug_spray, May 16 '20 at 14:45
I think there is no better way. You have a O(N) algorithm, and for the problem's nature you have to check the sex array one by one. — João Victor, May 16 '20 at 14:45
`the best way` - I imagine the **best** way is one that works. — wwii, May 16 '20 at 16:00

gboffi · Answer 1 · 2020-05-16T18:54:41.130

You can place references to the lists in a dictionary and do a list comprehension

In [8]: sexes = ['M', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F'] 
   ...: actresses = ['Natalie Portman', 'Anne Hathaway', 'Talia Shire', 'Diane Keaton', 'Keira Knightley', 'Uma Thurman', ] 
   ...: actors = ['Morgan Freeman', 'Leonardo DiCaprio', 'Robert De Niro', 'Brad Pitt']
   ...: 
   ...: mf = {'M':iter(actors), 'F':iter(actresses)} 
   ...: [(sex, next(mf[sex])) for sex in sexes]                                                                                                 
Out[8]: 
[('M', 'Morgan Freeman'),
 ('M', 'Leonardo DiCaprio'),
 ('F', 'Natalie Portman'),
 ('F', 'Anne Hathaway'),
 ('M', 'Robert De Niro'),
 ('F', 'Talia Shire'),
 ('M', 'Brad Pitt'),
 ('F', 'Diane Keaton'),
 ('F', 'Keira Knightley'),
 ('F', 'Uma Thurman')]

In [9]:

If your list are longish and you are going to consume one pair sex-person at once you can use a generator expression in place of the list comprehension

pairs = ((sex, next(mf[s])) for sex in sexes)
for sex, person in pairs:
    ...

or possibly even simpler

for sex in sexes:
    person =  next(mf[sex])
    ...

If your lists were stored on disk you can use the same pattern introduced above but using generator expressions in place of lists

mf = {'M':(line.strip() for line in open('male_performers.txt'), 
      'F':(line.strip() for line in open('female_performers.txt')}
sexes = (line.strip() for line in open('sexes.txt'))

for sex in sexes:
     performer = next(mf[sex])

Asocia · Accepted Answer · 2020-05-16T15:03:35.117

You are popping from starting of the list which has time complexity of O(N). What you could do instead is keep an index for both actors and actresses lists and increment them in the loop.

sex = ['M', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F']
actresses = ['Natalie Portman', 'Anne Hathaway', 'Talia Shire', 'Diane Keaton', 'Keira Knightley', 'Uma Thurman', ]
actors = ['Morgan Freeman', 'Leonardo DiCaprio', 'Robert De Niro', 'Brad Pitt']
result = []

actors_i = 0
actresses_i = 0

for s in sex:
    if s == 'F':
        result.append((s, actresses[actresses_i]))
        actresses_i += 1
    elif s == 'M':
        result.append((s, actors[actors_i]))
        actors_i += 1

print(f'result = {result}')

After this point, I don't think there are any improvements left other than making your code more readable because you have to go over every item in the sex list and you are using operations which has cost of O(1) in the loop. So the complexity is O(N).

bug_spray · Answer 3 · 2020-05-16T15:11:34.103

Given that all actors have a label of 'M' and all actresses have a label of 'F', you could use pandas to group the information in a way that should have faster performance than looping through large lists.

Here is an example:

import pandas as pd

actresses = ['Natalie Portman', 'Anne Hathaway', 'Talia Shire', 'Diane Keaton', 'Keira Knightley', 'Uma Thurman', ]
actors = ['Morgan Freeman', 'Leonardo DiCaprio', 'Robert De Niro', 'Brad Pitt']

df_actresses = pd.DataFrame(actresses, columns=['name'])
df_actors = pd.DataFrame(actors, columns=['name'])

df_actresses['sex'] = 'F'
df_actors['sex'] = 'M'

df = pd.concat([df_actresses, df_actors], axis=0)

# if you really need it to be a list
result = df.values.tolist()

score 1 · Answer 4 · answered May 21 '20 at 06:50

Thank you for all the answers. Yes, using pop(0) was a very bad idea in this case. I tried to compare all solutions for 1 million pseudo items. In my opinion the results were very good except for the use of pop(0).

Results:

combine_with_pop     Items = 1000000. Average time: 45.49504270553589 secs
combine_without_pop  Items = 1000000. Average time:  0.33301634788513185 secs
combine_dict         Items = 1000000. Average time:  0.21431212425231932 secs
combine_generator    Items = 1000000. Average time:  0.2770370960235596 secs
combine_frames       Items = 1000000. Average time:  0.06862187385559082 secs

Test:

import pandas as pd
import string
import random
import time
import inspect
from statistics import mean

result_size = 1000000
g_number_of_repetitions = 5


def init():
    # Generate sexes
    population = ('M', 'F')
    male_weight = 0.48
    weights = (0.4, 1 - male_weight)
    actresses = []
    actors = []
    sexes = random.choices(population, weights, k=result_size)
    male_amount = sexes.count('M')
    female_amount = result_size - male_amount

    # Generate pseudo 'actresses' and 'actors'
    act_len = 20
    for a in range(female_amount):
        actresses.append(''.join(random.choices(string.ascii_lowercase, k=act_len)))
    for a in range(male_amount):
        actors.append(''.join(random.choices(string.ascii_lowercase, k=act_len)))
    return sexes, actresses, actors


def combine_with_pop(number_of_repetitions, sexes, random_actresses, random_actors):
    time_measurements = []
    for i in range(number_of_repetitions):
        actors = random_actors[:]
        actresses = random_actresses[:]
        result = []
        t0 = time.time()
        for s in sexes:
            if s == 'F':
                result.append((s, actresses.pop(0)))
            elif s == 'M':
                result.append((s, actors.pop(0)))
        time_one_round = time.time() - t0
        time_measurements.append(time_one_round)
    print(
        f'{inspect.currentframe().f_code.co_name.ljust(20)} '
        f'Items = {result_size}. Average time: {str(mean(time_measurements))} secs')


def combine_without_pop(number_of_repetitions, sexes, random_actresses, random_actors):
    time_measurements = []
    for i in range(number_of_repetitions):
        actors = random_actors[:]
        actresses = random_actresses[:]
        result = []
        actors_i = 0
        actresses_i = 0
        t0 = time.time()
        for s in sexes:
            if s == 'F':
                result.append((s, actresses[actresses_i]))
                actresses_i += 1
            elif s == 'M':
                result.append((s, actors[actors_i]))
                actors_i += 1
        time_one_round = time.time() - t0
        time_measurements.append(time_one_round)

    print(
        f'{inspect.currentframe().f_code.co_name.ljust(20)} '
        f'Items = {result_size}. Average time: {str(mean(time_measurements))} secs')


def combine_dict(number_of_repetitions, sexes, random_actresses, random_actors):
    time_measurements = []
    for i in range(number_of_repetitions):
        actors = random_actors[:]
        actresses = random_actresses[:]
        result = []
        t0 = time.time()

        mf = {'M': iter(actors), 'F': iter(actresses)}
        result = [(sex, next(mf[sex])) for sex in sexes]
        time_one_round = time.time() - t0
        time_measurements.append(time_one_round)

    print(
        f'{inspect.currentframe().f_code.co_name.ljust(20)} '
        f'Items = {result_size}. Average time: {str(mean(time_measurements))} secs')


def combine_generator(number_of_repetitions, sexes, random_actresses, random_actors):
    time_measurements = []
    for i in range(number_of_repetitions):
        actors = random_actors[:]
        actresses = random_actresses[:]
        result = []
        t0 = time.time()

        mf = {'M': iter(actors), 'F': iter(actresses)}
        for sex in sexes:
            person = next(mf[sex])
            result.append((sex, person))

        time_one_round = time.time() - t0
        time_measurements.append(time_one_round)

    print(
        f'{inspect.currentframe().f_code.co_name.ljust(20)} '
        f'Items = {result_size}. Average time: {str(mean(time_measurements))} secs')


def combine_frames(number_of_repetitions, sexes, random_actresses, random_actors):
    time_measurements = []
    for i in range(number_of_repetitions):
        actors = random_actors[:]
        actresses = random_actresses[:]
        result = []
        df_actresses = pd.DataFrame(actresses, columns=['name'])
        df_actors = pd.DataFrame(actors, columns=['name'])

        t0 = time.time()

        df_actresses['sex'] = 'F'
        df_actors['sex'] = 'M'

        df = pd.concat([df_actresses, df_actors], axis=0)

        # if you really need it to be a list
        # result = df.values.tolist()

        time_one_round = time.time() - t0
        time_measurements.append(time_one_round)

    print(
        f'{inspect.currentframe().f_code.co_name.ljust(20)} '
        f'Items = {result_size}. Average time: {str(mean(time_measurements))} secs')


g_sexes, g_actresses, g_actors = init()
combine_with_pop(g_number_of_repetitions, g_sexes, g_actresses, g_actors)
combine_without_pop(g_number_of_repetitions, g_sexes, g_actresses, g_actors)
combine_dict(g_number_of_repetitions, g_sexes, g_actresses, g_actors)
combine_generator(g_number_of_repetitions, g_sexes, g_actresses, g_actors)
combine_frames(g_number_of_repetitions, g_sexes, g_actresses, g_actors)

Combining three lists in Python with sorting

4 Answers4