Solution
Apart from what others have suggested as answers, you could also do this using pandas
library. This may not be optimal if you have very little need for using pandas
. But given the many scenarios where one could use pandas
, here is an option.
The benchmarking section below shows that a typical list length of 1e8
(i.e 100 Million), the wall time for execution is about 23 seconds
. For one million records the execution time is around 350 ms
.
import pandas as pd
df = pd.DataFrame({'ID': list_of_IDs, 'VALUE': list_of_values})
df.groupby('ID')['VALUE'].apply(list).to_dict()
## Output:
# {0: [0, 1, 2], 1: [0, 1, 2, 3], 2: [0], 3: [1, 2], 100: [8, 9, 10]}
Single line solution
You could equivalently write all the operations as a chain of operations and get what you want in a single line (broken into multiple lines for enhanced readability) as follows.
pd.DataFrame({
'ID': list_of_IDs,
'VALUE': list_of_values
}).groupby('ID')['VALUE'].apply(list).to_dict()
## Output:
# {0: [0, 1, 2], 1: [0, 1, 2, 3], 2: [0], 3: [1, 2], 100: [8, 9, 10]}
Dummy Data
list_of_IDs = [0, 0, 0, 1, 1, 1, 1, 2, 3, 3, 100, 100, 100]
list_of_values = [0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 8, 9, 10]
Benchmarking Speed of Execution
Let us first prepare some dummy data (N = 1e4, 1e5, 1e6, 1e7, 1e8
) and then use this data to benchmark the execution (wall) time. It shows that a typical list length of 1e8
(i.e 100 Million), the wall time for execution is about 23 seconds. For one million records the execution time is around 350 ms
.
import pandas as pd
import numpy as np
N = 1e8
N = int(N)
ids = np.random.randint(low=0, high=100, size=N)
values = np.random.randint(low=0, high=10, size=N)
I used a Google Colab notebook and rand the following in a cell with the data prepared in above.
# run in a notebook
%%time
d = pd.DataFrame({
'ID': ids,
'VALUE': values
}).groupby('ID')['VALUE'].apply(list).to_dict()
Output:
| N | CPU Times (user, sys, total) | Wall Time |
|-----|------------------------------|-----------|
| 1e8 | (18.5 s, 4.38 s, 22.9 s) | 23 s |
| 1e7 | (2.19 s, 212 ms, 2.4 s) | 2.4 s |
| 1e6 | (317 ms, 31.6 ms, 348 ms) | 352 ms |
| 1e5 | (28.5 ms, 10.9 ms, 39.4 ms) | 42 ms |
| 1e4 | (12.6 ms, 2.92 ms, 15.5 ms) | 18 ms |
Refrences