This is to practice nested dictionary, or, array of dictionaries, list of dictionary etc.
The data structure can be best described as array of struct/class in C/C++, and each struct has multiple members. The challenge to me:
1). There is string "Sample Name" as separator at the beginning of each record followed by multi-members;
2). 6 members of the record in each row separate by colon ":";
3). how to read multiple lines (instead of multiple fields of the same line, which is easier to parse) into the member of the record;
4). The record separator may not be preceded with a blank line.
I put sample input and the expected output for testing.
Example: input.txt
Sample Name: CanNAM1_192 SNPs : 5392 MNPs : 0 Insertions : 248 Deletions : 359 Phased Genotypes : 8.8% (2349/26565) MNP Het/Hom ratio : - (0/0) Sample Name: CanNAM2_195 SNPs : 5107 MNPs : 0 Insertions : 224 Deletions : 351 Phased Genotypes : 8.9% (2375/26560) MNP Het/Hom ratio : - (0/0) Sample Name: CanNAM3_196 SNPs : 4926 MNPs : 0 Insertions : 202 Deletions : 332 Phased Genotypes : 8.0% (2138/26582) MNP Het/Hom ratio : - (0/0)
In awk there is RECORD separator RS and FIELD separator FS that can be set at beginning, but no such function in python to my knowledge.
Output.tab:
CanNAM1_192 5392 0 248 359 8.8% - CanNAM2_195 5107 0 224 351 8.9% - CanNAM3_196 4926 0 202 332 8.0% -
Tried search some example code for my case like this one, this one
import sys
filename=sys.argv[1]
Dictn = {}
with open(filename, 'r') as fh:
for line in fh:
while True:
if line.startswith('Sample Name'):
nameLine = line.strip()
ID = nameLine.split(':')
else:
line2 = next(fh).strip()
line2 = line2.split(':')
print (line2[0], line2[1]) # For debugging to see the parsing result
line3 = next(fh).strip().split(':')
line4 = next(fh).strip().split(':')
line5 = next(fh).strip().split(':')
line6 = next(fh).strip().split(':')
line7 = next(fh).strip().split(':')
Dictn.update({
ID[1]: {
line2[0]: line2[1],
line3[0]: line3[1],
line4[0]: line4[1],
line5[0]: line5[1],
line6[0]: line6[1],
line7[0]: line7[1],
}
})
break
print(Dictn)
Dictn.get('CanNAM1_192')
# {CanNAM1_192:{ {'SNPs' : '5392'}, {'MNPs' : '0'}, {'Insertions' : '248'}, {'Deletions' : '359'}, {'Phased Genotypes' : '8.8%'}, {'MNP Het/Hom ratio' : '-'} }}
I am stuck with the parsing each record by reading 7 lines at a time, then push/update the record into the dictionary. Not good at Python, and I really appreciate any help!