0

I have several tab delimited files that I want to read into dicts using csvDictreader. Each file contains several comment lines starting with '#' or '\t' before the start of actual data. The number of comment lines varies between files. I've been trying the methods outlined in this post but cannot seem to get it working.

Here is my current code:

def load_database_snps(inputFile):
    '''This function takes a txt tab delimited input file (in house database) and returns a list of dictionaries for each variant'''
    idStore = [] #empty list for storing variant records                                                                                                                                                                         
    with open(inputFile, 'r+') as varin:
        idStoreDictgroup = csv.DictReader((row for row in  varin if row.startswith('hr', 1, 2)),delimiter='\t') #create a generator; dictionary per snp (row) in the file                                                        
        idStoreDictgroup.fieldnames = [field.strip() for field in idStoreDictgroup.fieldnames] #strip whitespace from field names                                                                                                
        print(type(idStoreDictgroup))
        for d in idStoreDictgroup: #iterate over dictionaries in varin_dictgroup                                                                                                                                                 
            print(d)
            idStore.append(d) #attach to var_list                                                                                                                                                                               
    return idStore

Here is an example of an input file:

## SM=Sample,AD=Total Allele Depth, DP=Total Depth
## het;;; and homo;;; are breakdowns of variant read counts per sample - chr1:10002921 T>G AD=34 het:4;11;7;12 (sum=34)


        Hetereozygous                                       Homozygous                                      
    Chr     Start      End            ref           |A|     |C|     |G|     |T|     HetCount        |A|     |C|     |G|     |T|     HomCount        TotalCount      SampleCount
    chr1    10001102        10001102        T       0       0       SM=1;AD=22;DP=38        0       1       0       0       0       0       0       1       138     het:22; homo:-  
    chr1    10002921        10002921        T       0       0       SM=4;AD=34;DP=63        0       4       0       0       0       0       0       4       138     het:4;11;7;12;  homo:-

The lines I want to read in all begin with 'Chr' or 'chr'. I think its not working because I need to iterate over it to reformat the field names using the generator which exhausts it before the rows can be read into the dictionaries.

The error message I get is:

Traceback (most recent call last):
  File "snp_freq_V1-1_export.py", line 99, in <module>
    snp_check_wrapper(inputargs.snpstocheck, inputargs.snp_database_location)
  File "snp_freq_V1-1_export.py", line 92, in snp_check_wrapper
    snpDatabase = load_database_snps(databaseInputFile) #store database variants in snp_database (a dictionary)
  File "snp_freq_V1-1_export.py", line 53, in load_database_snps
    idStoreDictgroup.fieldnames = [field.strip() for field in idStoreDictgroup.fieldnames] #strip whitespace from field names
TypeError: 'NoneType' object is not iterable

I have tried doing the inverse of my current code and explicitly excluding rows starting with '#' and '\t'. But this also didn't work and just gives me a blank dictionary.

Community
  • 1
  • 1
s_boardman
  • 416
  • 3
  • 9
  • 27
  • 1
    Is there only one per file? eg... the above comments/headers won't be repeated more than once per file ? – Jon Clements Feb 06 '14 at 14:55
  • Yes, so from the example file I want it to use the row Chr Start... as the header and all the subsequent rows as the values for my dictionaries. – s_boardman Feb 06 '14 at 15:47

1 Answers1

1

What you should be able to do is skip all the preceding lines until something with a chr starts, such as:

import csv
from itertools import dropwhile

with open('somefile') as fin:
    start = dropwhile(lambda L: not L.lower().lstrip().startswith('chr'), fin)
    for row in csv.DictReader(start, delimiter='\t'):
        # do something
Jon Clements
  • 138,671
  • 33
  • 247
  • 280