In python, how to combine two very large lists with associated counts lists into single list with associated counts list?

Question

First, please note that I'm using the word "lists" generically. I suspect the best solution will utilize some sort of collection such as an ordered dict, "ordered set," etc.

I'm representing a collection of objects using two lists--an object list and a count list. The object list contain the objects, each one being unique. The objects are represented by long ints. The second list aligns with the first, and the items are ints which indicate how many of each associated object in the object list are in the collection (a count, as in a histogram).

The possible number of unique objects is huge, such as 2**64. However in practice, there will be perhaps 20,000-30,000 unique objects in a collection.

Now, given two collections: collection A (represented by lists obj_A and cnt_A) and collection B (represented by lists obj_B and cnt_B), I need to combine/add them into a new collection, C. Thus, I need to find all the objects in A that are also in B, and sum the A and B count numbers for those objects. Objects that are only in A or only in B will retain the count numbers from their respective collections.

For example, in the lists below the object 750 is in both lists and thus the count for 750 in the combined collection, C, is the sum of the counts in the A and B collections.

ojb_A = [4903, 750, 29868, 833]
cnt_A = [1,    3,   24,    3  ]

ojb_B = [2357, 39,  750,   38 ]
cnt_B = [8,    52,  6,     2  ]

Combining A and B into C gives:

ojb_C = [4903, 750, 29868, 833, 2357, 39,  38]
cnt_C = [1,    9,   24,    3,   8,    52,  2 ]

As initially mentioned, I suspect that some ordered representation of the object lists will be needed for efficiency, though I did not show the items as ordered in my example above.

[EDIT]: I just discovered that collections.Counter might serve my needs. But again, my collections have very large numbers of unique objects, so I'm looking for a highly efficient/fast solution.

I don't need to maintain order or the items/elements in the list. I simply suspect that ordered items will be critical for an efficient solution. — mattroos, Aug 06 '21 at 17:36
Yes, it would take little bit more runtime for ordering them as one of the original objects but still it is achievable efficiently using pandas — Kapil, Aug 06 '21 at 17:48
FYI, I think any initial sorting/ordering time could be ignored, assuming that ordered inputs results in ordered outputs. I'll be calling this operation thousands of times so the time spent on the initial sorting should be negligible in comparison to the time spent combining the collections. — mattroos, Aug 06 '21 at 18:13
@mattroos I have compared various methods and have also given a very competitive way using numba. Might not be necessary for this, but you can think of numba when you need machine code like speed — eroot163pi, Aug 06 '21 at 19:22
@eroot163pi I appreciate that deep examination of the speeds! I should clarify/reiterate that what I meant by 2^64 unique objects is that individual objects could be integers between 0 and 2^64-1. But individual collections likely have "only" 1e4 or 1e5 unique objects in them. So at the moment the "mammoth" sizes you explored are beyond my needs (that may change). Still, using numba might be useful, since I have to repeat the operation thousands of times and any speedup, however small, is amplified. Thank you! — mattroos, Aug 06 '21 at 21:52
Just an idea, in that case, you can constrain type from int64 for int16 in numba make it further faster. But yeah it will be be speed vs readability i guess — eroot163pi, Aug 06 '21 at 21:55

score 3 · Answer 1 · answered Aug 06 '21 at 17:19

You can use a defaultdict to count up all the counts:

from collections import defaultdict

ojb_A = [4903, 750, 29868, 833]
cnt_A = [1,    3,   24,    3  ]

ojb_B = [2357, 39,  750,   38 ]
cnt_B = [8,    52,  6,     2  ]

def count(out, ojb, cnt):
    for index,obj in enumerate(ojb):
        out[obj] += cnt[index]
    
def split_out(out):
    return list(out.keys()), list(out.values())

out = defaultdict(int)
count(out, ojb_A, cnt_A)
count(out, ojb_B, cnt_B)

ojb_C, cnt_C = split_out(out)
print(ojb_C, cnt_C)

fsimonjetz · Answer 2 · 2021-08-06T17:27:12.200

3

A combination of collections.Counter and zip is probably the most concise way and should also be fairly efficient.

from collections import Counter

counts = Counter(dict(zip(ojb_A, cnt_A)))
counts.update(dict(zip(ojb_B, cnt_B)))

ojb_C, cnt_C = zip(*counts.items())

(FYI, ojb_C and cnt_C are tuples.)

edited Aug 06 '21 at 17:27

answered Aug 06 '21 at 17:20

fsimonjetz

5,644
3
5
21

score 2 · Answer 3 · answered Aug 06 '21 at 17:46

You can do it efficiently using pandas:

import pandas as pd
df1 = pd.DataFrame(zip(ojb_A,cnt_A))
df2 = pd.DataFrame(zip(ojb_B,cnt_B))

df_combined = pd.concat([df1,df2]).groupby(0).sum()
ojb_C = list(df_combined.index.values)
cnt_C = list(df_combined[1])

It takes just 303 ms ± 24.9 ms to combine 2 objects of length ~100000

eroot163pi · Accepted Answer · 2021-08-06T19:32:14.177

Latest Update

I tweaked below code to use numba and is 2 times faster for larger inputs and can handle larger inputs without being killed on my local. This converts the function decorated with numba.njit to machine code and shines when input are huge. subsequent runs are always faster than first run due to compilation. I understand this might not be necessary for 10k-20k but might need for 2**64. (P.S. I use numba at work for speed refer here to see how it can speed up)

Using numba

import numpy as np
import numba as nb
ojb_A = np.array(ojb_A)
ojb_B = np.array(ojb_B)
cnt_A = np.array(cnt_A)
cnt_B = np.array(cnt_B)
@nb.njit
def count(out, ojb, cnt, i):
    for index in nb.prange(ojb.shape[0]):
        out[ojb[i]] = out.get(ojb[i], 0) + cnt[index]
    
def split_out(out):
    return list(out.keys()), list(out.values())

out = nb.typed.Dict.empty(
    key_type=nb.core.types.int64,
    value_type=nb.core.types.int64,
)
count(out, ojb_A, cnt_A, 0)
count(out, ojb_B, cnt_B, 1)

ojb_C, cnt_C = split_out(out)

new_test.py

import random
import sys
import time
import numba as nb

random.seed(31212223)

MAX = 2 ** int(sys.argv[2])

def f3():
    ojb_A = random.sample(range(1, 2**30), MAX)
    cnt_A = random.sample(range(1, 2**30), MAX)
    ojb_B = random.sample(range(1, 2**30), MAX)
    cnt_B = random.sample(range(1, 2**30), MAX)
    s1 = time.time()
    from collections import defaultdict
    # @nb.jit
    def count(out, ojb, cnt):
        for index,obj in enumerate(ojb):
            out[obj] += cnt[index]
        
    def split_out(out):
        return list(out.keys()), list(out.values())

    out = defaultdict(int)
    count(out, ojb_A, cnt_A)
    count(out, ojb_B, cnt_B)

    ojb_C, cnt_C = split_out(out)
    # print(ojb_C, cnt_C)
    s2 = time.time()
    print('quamrana', s2 - s1)

def f3_1():
    ojb_A = random.sample(range(1, 2**30), MAX)
    cnt_A = random.sample(range(1, 2**30), MAX)
    ojb_B = random.sample(range(1, 2**30), MAX)
    cnt_B = random.sample(range(1, 2**30), MAX)
    s1 = time.time()
    import numpy as np
    import numba as nb
    ojb_A = np.array(ojb_A)
    ojb_B = np.array(ojb_B)
    cnt_A = np.array(cnt_A)
    cnt_B = np.array(cnt_B)
    @nb.njit
    def count(out, ojb, cnt, i):
        for index in nb.prange(ojb.shape[0]):
            out[ojb[i]] = out.get(ojb[i], 0) + cnt[index]
        
    def split_out(out):
        return list(out.keys()), list(out.values())

    out = nb.typed.Dict.empty(
        key_type=nb.core.types.int64,
        value_type=nb.core.types.int64,
    )
    count(out, ojb_A, cnt_A, 0)
    count(out, ojb_B, cnt_B, 1)

    ojb_C, cnt_C = split_out(out)
    # print(ojb_C, cnt_C)
    s2 = time.time()
    print('eroot163pi', s2 - s1)


if __name__ == '__main__':
    # Two runs to show subsequent run is faster
    eval(f'{sys.argv[1]}()')
    eval(f'{sys.argv[1]}()')

Running on mamoth cases 2^22-27

Numba process successfully finish for 2^26, but normal python is killed
Numba is consistently faster for these inputs and is able to handle larger inputs limited by 2^27

(base) xxx@xxx:~$ python test.py f3 22
quamrana 3.2768890857696533
quamrana 3.2760112285614014
(base) xxx@xxx:~$ python test.py f3_1 22
eroot163pi 2.4150922298431396
eroot163pi 1.8658664226531982
(base) xxx@xxx:~$ python test.py f3 23
quamrana 6.903605937957764
quamrana 7.187666654586792
(base) xxx@xxx:~$ python test.py f3_1 23
eroot163pi 4.326314926147461
eroot163pi 3.6970062255859375
(base) xxx@xxx:~$ python test.py f3 24
quamrana 14.135217905044556
quamrana 14.102455615997314
(base) xxx@xxx:~$ python test.py f3_1 24
eroot163pi 8.097218751907349
eroot163pi 7.514840602874756
(base) xxx@xxx:~$ python test.py f3 25
quamrana 29.825793743133545
quamrana 30.300193786621094
(base) xxx@xxx:~$ python test.py f3_1 25
eroot163pi 16.243808031082153
eroot163pi 15.114825010299683
(base) xxx@xxx:~$ python test.py f3 26
Killed
(base) xxx@xxx:~$ python test.py f3_1 26
eroot163pi 35.73880386352539
eroot163pi 34.74332834847338
(base) xxx@xxx:~$ python test.py f3_1 27
Killed

Performance of other methods

This is comparison of some of above methods on mamoth test cases. Since I myself am curious about performance and quamrana performed really good.

For length of array greater than 2**26 all the solution attempts above were killed. (Due to my own system limitations, 16gb, 12core)

For the rest, consistently quamrana's solution performed best and Kapil's solution was 2nd best. Each solution was giving almost 2-3 times speedup than next consistently

test.py

import random
import sys
import time

random.seed(31212223)

MAX = 2 ** int(sys.argv[2])
ojb_A = random.sample(range(1, 2**30), MAX)
cnt_A = random.sample(range(1, 2**30), MAX)
ojb_B = random.sample(range(1, 2**30), MAX)
cnt_B = random.sample(range(1, 2**30), MAX)

def f1():
    s1 = time.time()
    import pandas as pd
    df1 = pd.DataFrame(zip(ojb_A,cnt_A))
    df2 = pd.DataFrame(zip(ojb_B,cnt_B))

    df_combined = pd.concat([df1,df2]).groupby(0).sum()
    ojb_C = list(df_combined.index.values)
    cnt_C = list(df_combined[1])
    s2 = time.time()
    print('Kapil', s2 - s1)

def f2():
    s1 = time.time()
    from collections import Counter

    counts = Counter(dict(zip(ojb_A, cnt_A)))
    counts.update(dict(zip(ojb_B, cnt_B)))

    ojb_C, cnt_C = zip(*counts.items())
    s2 = time.time()
    print('fsimonjetz', s2 - s1)

def f3():
    s1 = time.time()
    from collections import defaultdict

    def count(out, ojb, cnt):
        for index,obj in enumerate(ojb):
            out[obj] += cnt[index]
        
    def split_out(out):
        return list(out.keys()), list(out.values())

    out = defaultdict(int)
    count(out, ojb_A, cnt_A)
    count(out, ojb_B, cnt_B)

    ojb_C, cnt_C = split_out(out)
    # print(ojb_C, cnt_C)
    s2 = time.time()
    print('quamrana', s2 - s1)

if __name__ == '__main__':
    eval(f'{sys.argv[1]}()')

Performance on 2^20, 2^21, 2^22

(base) xxx@xxx:~$ python test.py f1 20
Kapil 2.0021448135375977
(base) xxx@xxx:~$ python test.py f2 20
fsimonjetz 2.720785617828369
(base) xxx@xxx:~$ python test.py f3 20
quamrana 0.717628002166748
(base) xxx@xxx:~$ python test.py f1 21
Kapil 4.06165337562561
(base) xxx@xxx:~$ python test.py f2 21
fsimonjetz 6.2198286056518555
(base) xxx@xxx:~$ python test.py f3 21
quamrana 1.563591718673706
(base) xxx@xxx:~$ python test.py f1 22
Kapil 8.361187934875488
(base) xxx@xxx:~$ python test.py f2 22
fsimonjetz 14.354418992996216
(base) xxx@xxx:~$ python test.py f3 22
quamrana 3.355391025543213

In python, how to combine two very large lists with associated counts lists into single list with associated counts list?

4 Answers4