3

I have millions of DNA clone reads and few of them are misreads or error. I want to separate the clean reads only.

For non biological background:

DNA clone consist of only four characters (A,T,C,G) in various permutation/combination. Any character, symbol or sign other that "A","T","C", and "G" in DNA is an error. Is there any way (fast/high throughput) in python to separate the clean reads only.

Basically I want to find a way through which I can separate a string which has nothing but "A","T","C","G" alphabet characters.

Edit
correct_read_clone: "ATCGGTTCATCGAATCCGGGACTACGTAGCA"

misread_clone: "ATCGGNATCGACGTACGTACGTTTAAAGCAGG" or "ATCGGTT@CATCGAATCCGGGACTACGTAGCA" or "ATCGGTTCATCGAA*TCCGGGACTACGTAGCA" or "AT?CGGTTCATCGAATCCGGGACTACGTAGCA" etc

I have tried the below for loop

check_list=['A','T','C','G']
for i in clone:
    if i not in check_list:
        continue

but the problem with this for loop is, it iterates over the string and match one by one which makes this process slow. To clean millions of clone this delay is very significant.

shivam
  • 596
  • 2
  • 9
  • are the misreads more frequent at the end of your reads or distributed equally on the read lenght s?? not sure but https://stackoverflow.com/a/75393644/9877065 should be faster starting from end of sequence than using set() ?? – pippo1980 Feb 09 '23 at 14:05
  • 1
    How did your sequence end up with not only N characters but also other non alphanumeric characters like '@' or '*'? This looks like an abuse of FASTQ/FASTA file formats and no standard bioinformatic workflow would require the parsing or removal of these sequences (other than perhaps those with Ns under certain circumstances) – Chris_Rands Feb 10 '23 at 17:11

5 Answers5

3

If these are the nucleotide sequences with an error in 2 of them,

a = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
b = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
c = 'ATUACTGAGTCAGTACGT'
d = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
e = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
f = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
g = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
test = a, b, c, d, e, f, g

try:

misread_counter = 0
correct_read_clone = []
for clone in test:
    if len(set(list(clone))) <= 4:
        correct_read_clone.append(clone)
    else:
        misread_counter +=1

print(f'Unclean sequences: {misread_counter}')
print(correct_read_clone)

Output:

Unclean sequences: 2
['ATACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT', 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT']

This way the for loop only has to attend each full sequence in a list of clones, rather than looping over each character of every sequence.

or if you want to know which ones have the errors you can make two lists:

misread_clone = []
correct_read_clone = []
for clone in test:
    bases = len(set(list(clone)))
    misread_clone.append(clone) if bases > 4 else correct_read_clone.append(clone)
      

print(f'misread sequences count: {len(misread_clone)}')
print(correct_read_clone)

Output:

misread sequences count: 2
['ATACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT', 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT']
3

I don't think you're going to get too many significant improvements for this. Most operations on a string are going to be O(N), and there isn't much you can do to get it to O(log(N)) or O(1). The checking for the values in ACTG is also O(N), leading to a worse case of O(n*m), where n and m are the lengths of the string and ACTG.

One thing you could do is cast the string into a set, which would be O(N), check if the length of set is more than 4 (which should be impossible if the only characters are ACTG) and if not, loop through the set and do the check against ACTG. I am assuming that it is possible that a clone could possibly be a string such as "AACCAA!!" which results in a set of ['A', 'C', '!'] in which case the length would be less than or equal to 4, but still be unclean/incorrect.

clones = [ "ACGTATCG", "AGCTGACGAT", "AGTACGATCAGT", "ACTGAGTCAGTACGT", "AGTACGTACGATCAGTACGT", "AAACCS", "AAACCCCCGGGGTTTS"]
for clone in clones:
    if len(set(clone)) > 4:
        print(f"unclean: {clone}")
    else:
        for char in clone:
            if char not in "ACTG":
                print(f"unclean: {clone}")
                break
        else:
            print(f"clean: {clone}")

Since len(set) is O(1), that could potentially skip the need to check against ACTG. If it is less than or equal to 4, then the check would be O(n*m) again, but in this case the n is guaranteed to be less than 4 while your m stays the same at 4. The final process becomes O(n) rather than O(n*m), where n and m are the lengths of the set and ACTG. Since you are now checking against a set and anything other than ACTG will be unclean, n has a cap of 4. This means that no matter how large the original string is, doing the ACTG check on the set will be worst case O(4*4) and is thus essentially O(1) (Big O notation is about scale rather than exact values).

However, whether or not this is actually faster would depend on the length of the original string. It may end up taking more time if the original string is short. This would be unlikely, since the string would have to be very short, but can be the case.

You may get more time saved by tackling the amount of entries which you have noted is very large, if possible you may want to consider if you can split this into smaller groups to run them asynchronously. However, at the end of the day none of these are going to actively scale down your time. They would reduce the time taken since you'd be cutting out a constant scale from the time complexity or running a few at the same time, but at the end of the day it's still an O(N*M), with N and M being the number and length of strings, and there isn't anything that can really change that.

Shorn
  • 718
  • 2
  • 13
  • what about multithreading processing of sort ?? – pippo1980 Feb 09 '23 at 14:09
  • @pippo1980 I'm not quite sure what you mean since none of the suggested solutions here, or the original problem, are using sort as far as I am aware. – Shorn Feb 10 '23 at 00:52
  • Hi , sorry bad translation I didnt mean of sort(). Was like what about multithreading/multiprocessing or sone related library to speed up the task – pippo1980 Feb 10 '23 at 09:24
  • Ah I get what you mean now. For future reference, the phrase is "of sorts". As for multithreading, I'm not familiar enough to say exactly which of the sections can be safely multithreaded. However, the largest time complexity is O(N) and is casting the string to set. I don't think it can be multithreaded to any significant benefit unless the strings are consistently extremely long. As such, it's probably better to consider the larger source of time which is the large original set. By splitting it and using some form of async, including threading, that's more likely to provide more time saved. – Shorn Feb 10 '23 at 12:25
  • Dont know if I am following right but if I can run set() on 4 strings using a processpool I could save 75% of the time ??? O(N)/4 ?? – pippo1980 Feb 10 '23 at 13:17
  • hi @Shorn could you have a peek at my answer below https://stackoverflow.com/a/75413855/9877065 and see if it is meaningful in respect to your above comment, Big O Notation is kind of new to me, cheers – pippo1980 Feb 10 '23 at 16:45
  • Not sure about how sorted() function works, but it has increased the screening speed significantly. Please see the below. – shivam Mar 15 '23 at 08:40
1

try this:

def is_clean_read(read):
    for char in read:
        if char not in ['A', 'T', 'C', 'G']:
            return False
    return True

reads = [ "ACGTATCG", "AGCTGACGAT", "AGTACGATCAGT", "ACTGAGTCAGTACGT", "AGTACGTACGATCAGTACGT"]

clean_reads = [read for read in reads if is_clean_read(read)]

print(clean_reads)
jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
Olie
  • 11
  • 2
1

ok stealing from answer https://stackoverflow.com/a/75393987/9877065 by Shorn, tried to add multiprocessing, you can play with the lenght of my orfs list in the first part of the code and then try to change the number_of_processes = XXX to different values from 1 to your system max : multiprocessing.cpu_count(), code :

import time

from multiprocessing import Pool

from datetime import datetime


a = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
b = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
c = 'ATUACTGAGTCAGTACGT'
d = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
e = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
f = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
g = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
aa = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
bb = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
cc = 'ATUACTGAGTCAGTACGT'
dd = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
ee = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
ff = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
gg = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
aaa = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
bbb = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
ccc = 'ATUACTGAGTCAGTACGT'
ddd = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
eee = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
fff = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
ggg = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
kkk = 'AAAAAAAAAAAAAAAAAAAAAAkkkkkkkkkkkkk'

clones = [a, b, c, d, e, f, g, aa, bb, cc, dd, ee, ff,gg, aaa, bbb, ccc, ddd, eee, fff, ggg, kkk]

clones_2 = clones
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
# clones_2.extend(clones)
# clones_2.extend(clones)

#print(clones_2, len(clones_2))

def check(clone):
    
    # ATTENZIONE ALLUNGA TEMPO CPU vs I/O ##############################################################################################################
    # time.sleep(1)                                      ####################################################### !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    if len(set(clone)) > 4:
        print(f"unclean: {clone}")
    else:
        for char in clone:
            if char not in "ACTG":
                print(f"unclean: {clone}")
                break
        else:
            print(f"clean: {clone}")
            
            
begin = datetime.now()

number_of_processes = 4


p = Pool(number_of_processes)

list_a = []

cnt_atgc = 0


while True:
        
        for i in clones_2 :
        
            try:
                list_a.append(i)
                
                cnt_atgc += 1
                
                if cnt_atgc == number_of_processes:
                    
                    result = p.map(check, list_a)
                    
                    p.close()
                    p.join()
                    
                    p = Pool(number_of_processes)
                    
                    cnt_atgc = 0
                    
                    list_a = []
                
                else:
                    continue
                    
            except:
                
                print('SKIPPED !!!')
                
                
                
        if len(list_a) > 0:
            
            p = Pool(number_of_processes)
            
            result = p.map(check, list_a)
                    
                    
            p.close()
            p.join()
            
            break
        
                    
        else:
                    
            print('FINITO !!!!!!!!!!')
            
            break
        
print('done')
        
print(datetime.now() - begin)         

I have to pre load a list containing the orfs to be multiprocessed at each iteration, despite that can at least cut the execuition time by half on my machine, not sure how stdout influence the speed of the multiprocessing (and how to cope with result order see python multiprocess.Pool show results in order in stdout).

pippo1980
  • 2,181
  • 3
  • 14
  • 30
  • 1
    Regarding the big O notation, the part which includes my answer stays the same, but the inclusion of multiprocessing will indeed meaningfully cut down the required time. Since big O notation denotes more of the time scaling with change to size, it doesn't change between the 2 answers. However, I would say it's a valid improvement and will definitely improve the runtime. The only considerations now are order, and any relevant overhead required to enable the multithreading as well as saving the data in a way that is usable by the user. – Shorn Feb 11 '23 at 15:58
1

Here I have created maximum possible sets of DNA sequence ie ('A','C','G','T'), ('A','C','G') and ('A','C') and trying to match with the set of clone sequence.

clones = [ "ACCCCGGGAAAAG", "ACCAACCTTTT", "AGTACGATCAGT", "ACTGAGTCAGTACGT", "AGTACGTACGATCAGTACGT", "AAACCS", "AAACCCCCGGGGTTTS"]
for clone in clones:
    
    clone_set=sorted(set(clone))
    # print(clone_set)

    if  clone_set!=['A','C','G','T']:
        print(f"unclean: {clone}")
    else:
        print(f"clean: {clone}")
shivam
  • 596
  • 2
  • 9
  • I also have no idea how sorted manages to improve the runtime, could be something under the hood that I am not aware of maybe related to the underlying hashtable of a set or how list comparisons work, but if it works for you that's great. I'm curious, since it's outside my field, but is it not possible to have the set be something like `['A', 'C', 'T']`, asking since there are some permutations not in your check for unclean. – Shorn Mar 16 '23 at 01:16
  • @shorn may be by this python is skipping one additional for loop in else statement (to check the length and match the items). Generally DNA read length is around 400 so it is impossible not to have any of the characters unless DNA is purposely generated and if so it is also a misread. therefore (if clone_set!=['A','C','G','T']) only suffice to check. Don't know why did I put the additional two checks... HaHaHa :P I have corrected it. – shivam Mar 16 '23 at 02:06
  • from a quick reading, the time complexity of checking lists are O(n) if they are the same length and O(1) otherwise. In that way your if statement is O(1) if unclean and O(n) if clean. The `sorted` function is usually also O(nlgn). But a major aspect is that in both our cases, n is a constant within 0 and 4 (if its > 4 the resulting functions become O(1) anyway). So both ways its essentially O(1). Some O(1)s are better than others, but that is a level of depth I have not ventured to. It likely only made a difference since you have to run it millions of times though. I've no idea. – Shorn Mar 16 '23 at 02:13