I'm making a python simulation for kinship analysis. The input is EIGENSTRAT .geno file, contains 0 and 2 numbers, 1.2 million per sample. I want to simulate "bad" positions, so randomly (according to a variable) put position to 9, what is "missing snp" in eigenstrat.
My problem, that somehow it's really slow. For 1 sample, with one probability factor it takes ~1900 sec, but I want to make around 1500 simulations. This code is for debugging mode. How can I make it faster? I tried using numba, but it only saves me ~100 sec.
possib = [0.95]
geno_arr = np.array([[2,0,2,0,2,0,2,0,0,2,0,0,2], [0,0,2,0,0,0,2,0,0,0,2,0,2]])
sample_list=['HG00244.SG', 'HG00356.SG']
t1_start = process_time()
for p in possib:
sample1 = geno_arr[0]
for i in range(1,len(sample_list)):
sample2 = geno_arr[1]
max_valid_marker=round(len(sample1)*p)
current_valid_marker_number=max_valid_marker
while current_valid_marker_number!=0:
sample_choice=np.array([0,1])
valami=np.random.choice(sample_choice, 1, replace=False) #random.choice(sample_choice)
pos_choice=np.arange(0, len(sample1))
pos_choice_random= np.random.choice(pos_choice, 1, replace=False)#random.choice(pos_choice)
if valami==0:
if sample1[pos_choice_random] != 9 and sample2[pos_choice_random] != 9:
sample1[pos_choice_random] = 9
current_valid_marker_number -= 1
else:
if sample1[pos_choice_random] != 9 and sample2[pos_choice_random] != 9:
sample2[pos_choice_random] = 9
current_valid_marker_number -= 1
t1_stop = process_time()
print("Full finish: ", t1_stop - t1_start)