Optimizing loop for millions of entry selections

Question

I have a python anonymisation mechanism that rely on generating fake data from existing attributes.

Those attributes are accessible in the domain D which is an array of 16 sets, each set representing values possible for each attributes. the attributes are ['uid', 'trans_id', 'trans_date', 'trans_type', 'operation', 'amount', 'balance', 'k_symbol', 'bank', 'acct_district_id', 'frequency', 'acct_date', 'disp_type', 'cli_district_id', 'gender', 'zip']

Some attributes have very few values (gener is M or F), some are unique (uid) and can have 1260000 different values.

The fake data is generated as tuples of randomly selected attributes inside the domain.

I have to generate nearly 2 million tuples.

The first implementation of this was:

def beta_step(I, V, beta, n, m, D):
    r = approx_binomial(m - n, beta)
    print("r = " + str(r))
    i = 0
    while i < r:
        t = []
        for attribute in D:
            a_j = choice(list(attribute))
            t.append(a_j)
        if t not in V + I:
            V.append(t)
            i += 1

This took around 0,5s for each tuple. Note that I and V are existing lists (with initialy respectively 1200000 and 800000 tuples)

I already found out that I could speed-up things by converting D to a 2D array once and for all, in order not to convert sets in list on each run

for attribute in D:
            a_j = choice(attribute)
            t.append(a_j)

This gets me down to 0.2s by tuple.

I also tried looping fewer times and generating multiple tuples at a time like so:

def beta_step(I, V, beta, n, m, D):
    D = [list(attr) for attr in D ] #Convert D in 2D list
    r = approx_binomial(m - n, beta)
    print("r = " + str(r))
    i = 0

    NT = 1000 #Number of tuples generated at a time

    while i < r:
        T = [[] for j in range(NT)]

        for attribute in D:
            a_j = choices(attribute,k=min(NT,r-i))
            for j in range(len(a_j)):
                T[j].append(a_j[j])

        for t in T:
            if t not in V + I:
                V.append(t)
                i += 1

But this takes around 220s for 1000 tuples so it is not faster than before.

I have timed the different parts and it seems that it is the last for loop that takes most of the time (Around 217s).

Is there any way I could speed things up in order not to run it for 50 hours?

======================= EDIT : I implemented @Larri suggestion like that :

def beta_step(I, V, beta, n, m, D):
    D = [list(attr) for attr in D ] #Convert D in list of lists
    I = set(tuple(t) for t in I)
    V = set(tuple(t) for t in V)
    r = approx_binomial(m - n, beta)
    print("r = " + str(r))
    i = 0

    print('SIZE I', len(I))
    print('SIZE V', len(V))


    NT = 1000 #Number of tuples to generate at each pass

    while i < r:
        T = [[] for j in range(min(NT,r-i))]

        for attribute in D:
            a_j = choices(attribute,k=min(NT,r-i))
            for j in range(len(a_j)):
                T[j].append(a_j[j])

        new_T = set(tuple(t) for t in T)  - I
        size_V_before = len(V)
        V.update(new_T)
        size_V_after = len(V)
        delta_V = size_V_after-size_V_before
        i += delta_V

    return [list(t) for t in V]

it now takes about 0s to add elements to V

In total, adding the 1680000 tuples took 91s However, converting back to a 2d array takes 200s, is there a way to make it faster that doesn't involve rewritting the whole program to work on sets ?

It's often a good idea to profile your code and see where it's spending most of its execution time. This is easy to do, see [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) — martineau, Dec 08 '20 at 09:22
@go2nirvana I cannot really give you the exact entry as iit is really big ararys. I can says that according to m,n and beta, r value is arround 1600000 I edited the post to explain better how D is constituted and the initial size of I and V — Raphael, Dec 08 '20 at 11:10

score 1 · Accepted Answer · answered Dec 08 '20 at 09:15

1

For the last for loop at least, consider converting to sets instead of using arrays. That allows you to use set.update() method without having to check if t is already included in V. This is assuming that you can incorporate the A in the logic somehow. From the given code I can't see any reference to A.

So you can change it to something like V.update(T). The i would then be the delta of len(V) before and after the operation.

answered Dec 08 '20 at 09:15

LTJ

1,197
3
16

It's a typo, it's actually V + I, not A + V. I edited that. I'll try you suggestion ! – Raphael Dec 08 '20 at 10:09
thanks to your idea @Larri, it now runs fast as hell ! Just checking that everything works as expected but it seems ok. I'll update the post accordingly to what I did. – Raphael Dec 08 '20 at 12:04

Optimizing loop for millions of entry selections

1 Answers1