0

I'm making a python simulation for kinship analysis. The input is EIGENSTRAT .geno file, contains 0 and 2 numbers, 1.2 million per sample. I want to simulate "bad" positions, so randomly (according to a variable) put position to 9, what is "missing snp" in eigenstrat.

My problem, that somehow it's really slow. For 1 sample, with one probability factor it takes ~1900 sec, but I want to make around 1500 simulations. This code is for debugging mode. How can I make it faster? I tried using numba, but it only saves me ~100 sec.

possib = [0.95]
geno_arr = np.array([[2,0,2,0,2,0,2,0,0,2,0,0,2], [0,0,2,0,0,0,2,0,0,0,2,0,2]])
sample_list=['HG00244.SG', 'HG00356.SG']
t1_start = process_time()

for p in possib:
    sample1 = geno_arr[0]
    for i in range(1,len(sample_list)):
        sample2 = geno_arr[1]
        max_valid_marker=round(len(sample1)*p)
        current_valid_marker_number=max_valid_marker
        while current_valid_marker_number!=0:
            sample_choice=np.array([0,1])
            valami=np.random.choice(sample_choice, 1, replace=False) #random.choice(sample_choice)
            pos_choice=np.arange(0, len(sample1))
            pos_choice_random= np.random.choice(pos_choice, 1, replace=False)#random.choice(pos_choice)
            if valami==0:
                if sample1[pos_choice_random] != 9 and sample2[pos_choice_random] != 9:
                    sample1[pos_choice_random] = 9
                    current_valid_marker_number -= 1
            else:
                if sample1[pos_choice_random] != 9 and sample2[pos_choice_random] != 9:
                    sample2[pos_choice_random] = 9
                    current_valid_marker_number -= 1

t1_stop = process_time()
print("Full finish: ", t1_stop - t1_start)
martineau
  • 119,623
  • 25
  • 170
  • 301
  • 5
    I suggest [profiling](https://docs.python.org/3/library/profile.html) your code first to find what does consume most time. – Daweo May 13 '21 at 09:52
  • 1
    Which array is the 1.2 million items array? `geno_arr`? – 9769953 May 13 '21 at 09:55
  • Can you post a longer sample so we can test how slow it is, and maybe find an improvement? – Azuuu May 13 '21 at 10:01
  • 1
    See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) – martineau May 13 '21 at 10:36
  • Thank you for the profiling links! Yes, geno_arr is the big one. You can add any 0 or 2 to the array, it's just a random arrany, not the "normal" dataset – Emil Nyerki May 13 '21 at 10:52

1 Answers1

1

You want to set with a probability of p one element in each column of a array to 9:

geno_arr = np.array([[2,0,2,0,2,0,2,0,0,2,0,0,2], [0,0,2,0,0,0,2,0,0,0,2,0,2]])
max_valid_marker = round(geno_arr.shape[1]*p)
index1 = np.random.choice(np.arange(geno_arr.shape[0]), max_valid_marker, replace=True)
index2 = np.random.choice(np.arange(geno_arr.shape[1]), max_valid_marker, replace=False)
geno_arr[index1,index2] = 9
Daniel
  • 42,087
  • 4
  • 55
  • 81