0

This is to practice nested dictionary, or, array of dictionaries, list of dictionary etc. The data structure can be best described as array of struct/class in C/C++, and each struct has multiple members. The challenge to me:
1). There is string "Sample Name" as separator at the beginning of each record followed by multi-members;
2). 6 members of the record in each row separate by colon ":";
3). how to read multiple lines (instead of multiple fields of the same line, which is easier to parse) into the member of the record;
4). The record separator may not be preceded with a blank line.
I put sample input and the expected output for testing.
Example: input.txt


Sample Name: CanNAM1_192
SNPs                         : 5392
MNPs                         : 0
Insertions                   : 248
Deletions                    : 359
Phased Genotypes             : 8.8% (2349/26565)
MNP Het/Hom ratio            : - (0/0)

Sample Name: CanNAM2_195
SNPs                         : 5107
MNPs                         : 0
Insertions                   : 224
Deletions                    : 351
Phased Genotypes             : 8.9% (2375/26560)
MNP Het/Hom ratio            : - (0/0)

Sample Name: CanNAM3_196
SNPs                         : 4926
MNPs                         : 0
Insertions                   : 202
Deletions                    : 332
Phased Genotypes             : 8.0% (2138/26582)
MNP Het/Hom ratio            : - (0/0)

In awk there is RECORD separator RS and FIELD separator FS that can be set at beginning, but no such function in python to my knowledge.

Output.tab:

CanNAM1_192  5392  0  248  359  8.8%  - 
CanNAM2_195  5107  0  224  351  8.9%  - 
CanNAM3_196  4926  0  202  332  8.0%  - 

Tried search some example code for my case like this one, this one

import sys

filename=sys.argv[1]

Dictn = {}

with open(filename, 'r') as fh:
    for line in fh:
        while True:
            if line.startswith('Sample Name'):
                nameLine = line.strip()
                ID = nameLine.split(':')
            else:
                line2 = next(fh).strip()
                line2 = line2.split(':')
                print (line2[0], line2[1]) # For debugging to see the parsing result
                line3 = next(fh).strip().split(':')
                line4 = next(fh).strip().split(':')
                line5 = next(fh).strip().split(':')
                line6 = next(fh).strip().split(':')
                line7 = next(fh).strip().split(':')
                Dictn.update({
                            ID[1]: {
                                line2[0]: line2[1],
                                line3[0]: line3[1],
                                line4[0]: line4[1],
                                line5[0]: line5[1],
                                line6[0]: line6[1],
                                line7[0]: line7[1],
                                }
                             })
                break
print(Dictn)

Dictn.get('CanNAM1_192')
# {CanNAM1_192:{ {'SNPs' : '5392'}, {'MNPs' : '0'}, {'Insertions' : '248'}, {'Deletions' : '359'}, {'Phased Genotypes' : '8.8%'}, {'MNP Het/Hom ratio' : '-'} }}

I am stuck with the parsing each record by reading 7 lines at a time, then push/update the record into the dictionary. Not good at Python, and I really appreciate any help!

Yifangt
  • 151
  • 1
  • 10

2 Answers2

1
data = {}
with open("data.txt",'r') as fh:
    for line in fh.readlines(): #read in multiple lines
        if len(line.strip())==0:
            continue

        if line.startswith('Sample Name'):
            nameLine = line.strip()
            name = nameLine.split(": ")[1]
            data[name] = {}
        else:
            splitLine = line.split(":")
            variableName = splitLine[0].strip()
            value = splitLine[1].strip()
            data[name][variableName] = value

print(data)
  1. Make sure that the line you're reading in is not empty. If you strip all the empty space from an empty line, you'll get a string with length zero. We just check for this.
  2. If the line starts with Sample Name, we know that the id will come after a colon and a space. We can split by these characters. The id will be the second part of the split line, and so we just get the item at index one.
  3. Keep track of the current id, in a variable (I call it name). Create an empty nested dictionary entry for that id.
  4. If the line is not an ID line, then it must be a data line associated with the last entered ID.
  5. We get the line, split it by :. The name of the variable will be on the left, the first item, and the value will be on the right, so the second item. Make sure we strip all the extra spaces on either side.
  6. Add the variable and value pair to the dictionary entry for the ID.
eshanrh
  • 348
  • 3
  • 14
  • Thanks! I overthought the problem as **7 lines as a single record** must be read in at once. One more question, my input file is very big (~30GB), is readlines() good for that? – Yifangt Oct 17 '19 at 00:03
  • I think you have bigger problems than readlines() then. Do you have a computer that as over 30gb of RAM? If not, the computer won't be able to hold the file in memory. I'd suggest splitting your data file up into manageable chunks. I'm not too sure about the performance of readlines() vs other methods though. – eshanrh Oct 17 '19 at 00:30
  • That's why I call the 'next()' function many times hoping to read multiple lines for a single record. – Yifangt Oct 17 '19 at 14:15
0

Spent more time on the question and got a solution, which seems NOT "pythonic" as my code handling the first "record" (8 lines of data including the blank line at the bottom) are redundant of the rest.

import itertools
data = {}
with open("vcfstats.txt", 'r') as f:
    for line in f:
        if line.strip():                #Non blank line
            if line.startswith('Sample Name'):
                nameLine = line.strip()
                name = nameLine.split(": ")[1].strip()
                data[name] = {}
            else:
                splitLine = line.split(": ")
                variableName = splitLine[0].strip()
                values = splitLine[1].strip().split(" ")
                data[name][variableName] = values[0]        #Only take the first item as value
        else:
             continue

    for line in itertools.islice(f, 8):
        lines = (line.rstrip() for line in f)          # including blank lines
        lines = list(line for line in lines if line)   # skip blank lines

        for line in lines:
            if line.startswith('Sample Name'):
                nameLine = line.strip()
                name = nameLine.split(": ")[1].strip()
                data[name] = {}
            else:
                splitLine = line.split(": ")
                variableName = splitLine[0].strip()
                values = splitLine[1].strip().split(" ")
                data[name][variableName] = values[0]        #Only take the first item as value

What did I miss? Thanks a lot!

Yifangt
  • 151
  • 1
  • 10