How to organize values in one list by common values in another list?

Question

I have two lists, one with IDs and one with values corresponding to that list. Sometimes the first list can have multiple of the same IDs. For example:

list_of_IDs    = [0, 0, 0, 1, 1, 1, 1, 2, 3, 3, ..., 100, 100,100]
list_of_values = [0, 1, 2, 0, 1, 2, 3, 0, 1, 2, ...,   8,   9, 10]

I want to combine these lists into a dictionary (or similar) such that I have the list of values for each ID:

dict_of_vals_for_IDs = {0: [0,1,2], 1:[0,1,2,3], 2:[0], 3:[1,2], ...,100:[8,9,10]}

What is the quickest and most efficient way to do this?

Edit: I have tried a number of solutions that loop over each unique value in list_of_IDs and find all indices corresponding to each ID. However, these solutions seem to be quite slow since my actual lists are millions of elements long. Is there a quicker solution than one that uses loops or list comprehensions?

What have you tried and what are getting stuck on? – AChampion Oct 06 '20 at 19:56 — AChampion, Oct 06 '20 at 19:56

score 2 · Accepted Answer · answered Oct 06 '20 at 19:54

2

You could initialize a dict, then iterate through each of the IDs and values and then update the dict with those values.

dict_of_vals_for_IDs = {i: [] for i in set(list_of_IDs)}
for k, v in zip(list_of_IDs, list_of_values):
    dict_of_vals_for_IDs[k].append(v)

answered Oct 06 '20 at 19:54

Kevin Sheng

411
1
5
8

This ended up being the fastest solution of the three current options: balderman: 5.25s. CypherX: 125.15s. Kevin Sheng: 4.77s – jruss Oct 06 '20 at 23:01

CypherX · Answer 2 · 2020-10-06T21:57:47.857

Solution

Apart from what others have suggested as answers, you could also do this using pandas library. This may not be optimal if you have very little need for using pandas. But given the many scenarios where one could use pandas, here is an option.

The benchmarking section below shows that a typical list length of 1e8 (i.e 100 Million), the wall time for execution is about 23 seconds. For one million records the execution time is around 350 ms.

import pandas as pd

df = pd.DataFrame({'ID': list_of_IDs, 'VALUE': list_of_values})
df.groupby('ID')['VALUE'].apply(list).to_dict()

## Output: 
# {0: [0, 1, 2], 1: [0, 1, 2, 3], 2: [0], 3: [1, 2], 100: [8, 9, 10]}

Single line solution

You could equivalently write all the operations as a chain of operations and get what you want in a single line (broken into multiple lines for enhanced readability) as follows.

pd.DataFrame({
    'ID': list_of_IDs, 
    'VALUE': list_of_values
}).groupby('ID')['VALUE'].apply(list).to_dict()

## Output: 
# {0: [0, 1, 2], 1: [0, 1, 2, 3], 2: [0], 3: [1, 2], 100: [8, 9, 10]}

Dummy Data

list_of_IDs    = [0, 0, 0, 1, 1, 1, 1, 2, 3, 3, 100, 100, 100]
list_of_values = [0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 8, 9, 10]

Benchmarking Speed of Execution

Let us first prepare some dummy data (N = 1e4, 1e5, 1e6, 1e7, 1e8) and then use this data to benchmark the execution (wall) time. It shows that a typical list length of 1e8 (i.e 100 Million), the wall time for execution is about 23 seconds. For one million records the execution time is around 350 ms.

import pandas as pd
import numpy as np

N = 1e8
N = int(N)
ids = np.random.randint(low=0, high=100, size=N)
values = np.random.randint(low=0, high=10, size=N)

I used a Google Colab notebook and rand the following in a cell with the data prepared in above.

# run in a notebook
%%time 
d = pd.DataFrame({
    'ID': ids, 
    'VALUE': values
}).groupby('ID')['VALUE'].apply(list).to_dict()

Output:

|  N  | CPU Times (user, sys, total) | Wall Time |
|-----|------------------------------|-----------|
| 1e8 | (18.5  s, 4.38  s, 22.9  s)  | 23  s     |
| 1e7 | (2.19  s, 212  ms, 2.4   s)  | 2.4  s    |
| 1e6 | (317  ms, 31.6 ms, 348  ms)  | 352 ms    |
| 1e5 | (28.5 ms, 10.9 ms, 39.4 ms)  | 42  ms    |
| 1e4 | (12.6 ms, 2.92 ms, 15.5 ms)  | 18  ms    |

Refrences

How to group dataframe rows into list in pandas groupby?

@jruss Did you see the benchmarking section? I think you mentioned `speed` is concern for millions of rows of data. I compared `1`, `10` and `100` **million** rows of data. — CypherX, Oct 06 '20 at 21:43
I like the use of pandas, especially in the one-line solution but it actually ended up being significantly slower than the other solutions. See the accepted solution for my benchmarks. — jruss, Oct 06 '20 at 22:59

score 0 · Answer 3 · answered Oct 06 '20 at 19:59

See below

from collections import defaultdict
ids    = [0, 0, 0, 1, 1, 1, 1, 2, 3, 3]
values = [0, 1, 2, 0, 1, 2, 3, 0, 1, 2]
d = defaultdict(list)
for idx,x in enumerate(ids):
  d[x].append(values[idx])
print(d)

output

defaultdict(<class 'list'>, {0: [0, 1, 2], 1: [0, 1, 2, 3], 2: [0], 3: [1, 2]})