I have a python anonymisation mechanism that rely on generating fake data from existing attributes.
Those attributes are accessible in the domain D which is an array of 16 sets, each set representing values possible for each attributes.
the attributes are ['uid', 'trans_id', 'trans_date', 'trans_type', 'operation', 'amount', 'balance', 'k_symbol', 'bank', 'acct_district_id', 'frequency', 'acct_date', 'disp_type', 'cli_district_id', 'gender', 'zip']
Some attributes have very few values (gener is M or F), some are unique (uid) and can have 1260000 different values.
The fake data is generated as tuples of randomly selected attributes inside the domain.
I have to generate nearly 2 million tuples.
The first implementation of this was:
def beta_step(I, V, beta, n, m, D):
r = approx_binomial(m - n, beta)
print("r = " + str(r))
i = 0
while i < r:
t = []
for attribute in D:
a_j = choice(list(attribute))
t.append(a_j)
if t not in V + I:
V.append(t)
i += 1
This took around 0,5s for each tuple. Note that I and V are existing lists (with initialy respectively 1200000 and 800000 tuples)
I already found out that I could speed-up things by converting D to a 2D array once and for all, in order not to convert sets in list on each run
for attribute in D:
a_j = choice(attribute)
t.append(a_j)
This gets me down to 0.2s by tuple.
I also tried looping fewer times and generating multiple tuples at a time like so:
def beta_step(I, V, beta, n, m, D):
D = [list(attr) for attr in D ] #Convert D in 2D list
r = approx_binomial(m - n, beta)
print("r = " + str(r))
i = 0
NT = 1000 #Number of tuples generated at a time
while i < r:
T = [[] for j in range(NT)]
for attribute in D:
a_j = choices(attribute,k=min(NT,r-i))
for j in range(len(a_j)):
T[j].append(a_j[j])
for t in T:
if t not in V + I:
V.append(t)
i += 1
But this takes around 220s for 1000 tuples so it is not faster than before.
I have timed the different parts and it seems that it is the last for loop that takes most of the time (Around 217s).
Is there any way I could speed things up in order not to run it for 50 hours?
======================= EDIT : I implemented @Larri suggestion like that :
def beta_step(I, V, beta, n, m, D):
D = [list(attr) for attr in D ] #Convert D in list of lists
I = set(tuple(t) for t in I)
V = set(tuple(t) for t in V)
r = approx_binomial(m - n, beta)
print("r = " + str(r))
i = 0
print('SIZE I', len(I))
print('SIZE V', len(V))
NT = 1000 #Number of tuples to generate at each pass
while i < r:
T = [[] for j in range(min(NT,r-i))]
for attribute in D:
a_j = choices(attribute,k=min(NT,r-i))
for j in range(len(a_j)):
T[j].append(a_j[j])
new_T = set(tuple(t) for t in T) - I
size_V_before = len(V)
V.update(new_T)
size_V_after = len(V)
delta_V = size_V_after-size_V_before
i += delta_V
return [list(t) for t in V]
it now takes about 0s to add elements to V
In total, adding the 1680000 tuples took 91s However, converting back to a 2d array takes 200s, is there a way to make it faster that doesn't involve rewritting the whole program to work on sets ?