0

I have a list of dictionaries that contain bacterial name as keys, and as values a set of numbers identifying a DNA sequence. Unluckily, in some dictionaries there is a missing value, and the script fails to produce the csv. Can anyone give me an idea on how I can get around it? This is my script:

import glob, subprocess, sys, os, csv
from Bio import SeqIO, SearchIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

def allele():
    folders=sorted(glob.glob('path_to_files'))
    dict_list=[]
    for folder in folders:
        fasta_file=glob.glob(folder +'/file.fa')[0]
        input_handle=open(fasta_file ,'r')
        records=list(SeqIO.parse(input_handle, 'fasta'))
        namelist=[]
        record_dict={}
        sampleID = os.path.basename(folder)
        record_dict['sampleid']=sampleID
        for record in records:
            name=record.description.split('\t')
            gene=record.id.split('_')
            geneID=gene[0] + '_' +gene[1]
            allele=gene[2]      
            record_dict[geneID]=allele
        dict_list.append(record_dict)
    header = dict_list[0].keys()
    with open('path_to_files/mycsv.csv', 'w') as csv_output:
        writer=csv.DictWriter(csv_output,header,delimiter='\t')
        writer.writeheader()
        for samp in dict_list:
            writer.writerow(samp)


print 'start'           
allele()

Also can I get any suggestion on how to identify those dictionaries whose values sequence are the same? Thanks

Ana
  • 131
  • 1
  • 14
  • By "missing value", do you mean a entire {key:value} combination is missing, i.e., the dicts are shorter? For the second question, do you think about a somewhat inverse dictionary, e.g., the DNA sequence as key and then a list of bacteria that have that sequence as value? – mikuszefski Nov 23 '17 at 15:26
  • Note, solutions might depend on the size of your data, do we talk about a few thousand or more about billions? – mikuszefski Nov 23 '17 at 15:28
  • @mikuszefski-just a in total is about 944 genes so I guess that a few thousand. The missing value means that the dict values are less ie. if every dict has 100 values, one have 99 because the record.seq is missing. The dictionaries have this structure [{keys: 1, 2, 3, 5}, keys: 3, 5.6,12}]. They keys are the bacterial strain, and the numbers represent the genes to which I gave a number as id. – Ana Nov 23 '17 at 19:32
  • With your second question do you mean the dicts are identical except for the sampleid? I.e. you are looking for dicts that the same values for all keys? – mikuszefski Nov 24 '17 at 08:33
  • @mikuszefski- yes, the keys are the bacterial strains and the values the sequence of genes present in each bacterial strain. The same values and the same order. – Ana Nov 24 '17 at 09:52

1 Answers1

0

Concerning your first question, I'd just get the shorter dict's and fill, the missing entry with something that your dictWriter works with (didn't use it, ever), I guess NaN may work.

The simple thing would look like

testDict1 = {   "sampleid" : 0,
                "bac_s1" : [1,2,4],
                "bac_s2" : [1,2,12], 
                "bac_s3" : [1,3,12], 
                "bac_s4" : [1,6,12], 
                "bac_s5" : [1,9,14]
}


testDict2 = {   "sampleid" : 1,
                "bac_s1" : [1,2,4],
                "bac_s2" : [1,3,12], 
                "bac_s3" : [1,3,12], 
                "bac_s5" : [2,9,14], 
}


testDict3 = {   "sampleid" : 2,
                "bac_s1" : [3,2,4],
                "bac_s2" : [4,2,12], 
                "bac_s3" : [5,3,12], 
                "bac_s4" : [1,6,12], 
                "bac_s5" : [1,9,14]
}

dictList = [ testdict1, testdict2, testdict3 ]



### modified from https://stackoverflow.com/a/16974075/803359
### note, does not tell you elements that are present in shortdict but missing in long dict
### i.e. for your purpose you have to assume that keys in short dict are really present in longdict 
def missing_elements( longdict, shortdict ):
    list1 = longdict.keys()
    list2 = shortdict.keys()
    assert len( list1 ) >= len( list2 )
    return sorted( set( list1 ).difference( list2 ) )


### make the first dict the longest
dictList = sorted( dictList, key=len, reverse=True )

for myDict in dictList[1:]:
    ### compare all others to the first
    missing = missing_elements( dictList[0], myDict )
    print missing
    ### then you fill with something empty or NaN that works with your save-function
    for m in missing:
        myDict[m] = [ float( 'nan' ) ]

print " "      
for myDict in dictList:
    print myDict
    print " " 

which you can incorporate in your code easily.

mikuszefski
  • 3,943
  • 1
  • 25
  • 38