-3

I have two lists, one with IDs and one with values corresponding to that list. Sometimes the first list can have multiple of the same IDs. For example:

list_of_IDs    = [0, 0, 0, 1, 1, 1, 1, 2, 3, 3, ..., 100, 100,100]
list_of_values = [0, 1, 2, 0, 1, 2, 3, 0, 1, 2, ...,   8,   9, 10]

I want to combine these lists into a dictionary (or similar) such that I have the list of values for each ID:

dict_of_vals_for_IDs = {0: [0,1,2], 1:[0,1,2,3], 2:[0], 3:[1,2], ...,100:[8,9,10]}

What is the quickest and most efficient way to do this?

Edit: I have tried a number of solutions that loop over each unique value in list_of_IDs and find all indices corresponding to each ID. However, these solutions seem to be quite slow since my actual lists are millions of elements long. Is there a quicker solution than one that uses loops or list comprehensions?

martineau
  • 119,623
  • 25
  • 170
  • 301
jruss
  • 53
  • 4

3 Answers3

2

You could initialize a dict, then iterate through each of the IDs and values and then update the dict with those values.

dict_of_vals_for_IDs = {i: [] for i in set(list_of_IDs)}
for k, v in zip(list_of_IDs, list_of_values):
    dict_of_vals_for_IDs[k].append(v)
Kevin Sheng
  • 411
  • 1
  • 5
  • 8
  • This ended up being the fastest solution of the three current options: balderman: 5.25s. CypherX: 125.15s. Kevin Sheng: 4.77s – jruss Oct 06 '20 at 23:01
1

Solution

Apart from what others have suggested as answers, you could also do this using pandas library. This may not be optimal if you have very little need for using pandas. But given the many scenarios where one could use pandas, here is an option.

The benchmarking section below shows that a typical list length of 1e8 (i.e 100 Million), the wall time for execution is about 23 seconds. For one million records the execution time is around 350 ms.

import pandas as pd

df = pd.DataFrame({'ID': list_of_IDs, 'VALUE': list_of_values})
df.groupby('ID')['VALUE'].apply(list).to_dict()

## Output: 
# {0: [0, 1, 2], 1: [0, 1, 2, 3], 2: [0], 3: [1, 2], 100: [8, 9, 10]}

Single line solution

You could equivalently write all the operations as a chain of operations and get what you want in a single line (broken into multiple lines for enhanced readability) as follows.

pd.DataFrame({
    'ID': list_of_IDs, 
    'VALUE': list_of_values
}).groupby('ID')['VALUE'].apply(list).to_dict()

## Output: 
# {0: [0, 1, 2], 1: [0, 1, 2, 3], 2: [0], 3: [1, 2], 100: [8, 9, 10]}

Dummy Data

list_of_IDs    = [0, 0, 0, 1, 1, 1, 1, 2, 3, 3, 100, 100, 100]
list_of_values = [0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 8, 9, 10]

Benchmarking Speed of Execution

Let us first prepare some dummy data (N = 1e4, 1e5, 1e6, 1e7, 1e8) and then use this data to benchmark the execution (wall) time. It shows that a typical list length of 1e8 (i.e 100 Million), the wall time for execution is about 23 seconds. For one million records the execution time is around 350 ms.

import pandas as pd
import numpy as np

N = 1e8
N = int(N)
ids = np.random.randint(low=0, high=100, size=N)
values = np.random.randint(low=0, high=10, size=N)

I used a Google Colab notebook and rand the following in a cell with the data prepared in above.

# run in a notebook
%%time 
d = pd.DataFrame({
    'ID': ids, 
    'VALUE': values
}).groupby('ID')['VALUE'].apply(list).to_dict()

Output:

|  N  | CPU Times (user, sys, total) | Wall Time |
|-----|------------------------------|-----------|
| 1e8 | (18.5  s, 4.38  s, 22.9  s)  | 23  s     |
| 1e7 | (2.19  s, 212  ms, 2.4   s)  | 2.4  s    |
| 1e6 | (317  ms, 31.6 ms, 348  ms)  | 352 ms    |
| 1e5 | (28.5 ms, 10.9 ms, 39.4 ms)  | 42  ms    |
| 1e4 | (12.6 ms, 2.92 ms, 15.5 ms)  | 18  ms    |

Refrences

CypherX
  • 7,019
  • 3
  • 25
  • 37
  • @jruss Did you see the benchmarking section? I think you mentioned `speed` is concern for millions of rows of data. I compared `1`, `10` and `100` **million** rows of data. – CypherX Oct 06 '20 at 21:43
  • I like the use of pandas, especially in the one-line solution but it actually ended up being significantly slower than the other solutions. See the accepted solution for my benchmarks. – jruss Oct 06 '20 at 22:59
0

See below

from collections import defaultdict
ids    = [0, 0, 0, 1, 1, 1, 1, 2, 3, 3]
values = [0, 1, 2, 0, 1, 2, 3, 0, 1, 2]
d = defaultdict(list)
for idx,x in enumerate(ids):
  d[x].append(values[idx])
print(d)

output

defaultdict(<class 'list'>, {0: [0, 1, 2], 1: [0, 1, 2, 3], 2: [0], 3: [1, 2]})
balderman
  • 22,927
  • 7
  • 34
  • 52