0

I have a situation, in which I'm parsing a file and collecting stats. I want to store those stats in nested dict which has a final value as a list. And as I process the file I want to expand the list.

for instance my dict structure is something like this

data_dict 
    { "aa1" :
         { 'aa' : []}
         { 'bb' : [] }
     "aa2" : 
         { 'ab' : []}
         { 'ba' : [] }
    }

Now as I parse the file I want to append the value to the last list for instance, in first occurrence of data my dict should look like this.

data_dict 
    { "aa1" :
         { 'aa' : ['a0']}
         { 'bb' : ['a1'] }
     "aa2" : 
         { 'ab' : ['b0']}
         { 'ba' : ['b1'] }
    }

and in second something like this

data_dict 
    { "aa1" :
         { 'aa' : ['a0', 'a01']}
         { 'bb' : ['a1', 'a11'] }
     "aa2" : 
         { 'ab' : ['b0', 'b01']}
         { 'ba' : ['b1', 'b11'] }
    }

Also I'm not initializing dict keys to anything and creating keys at the first occurrence of the match. Can anyone suggest how do I achieve this?

Note I'm using autovivification for initializing my data_dict, which at first doesn't contain anything.

This is sample data I'm trying to parse

DATETIME TYPE TAG  COUNT MEAN 1% 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% 
20151109044056 LS_I aa8 57     80,493,122      8,931,000      8,937,000      8,944,000      8,974,000      9,073,000     21,262,000     28,419,000     35,794,000    148,920,000    316,408,000    447,902,000 
    20151109044056 LS_I aa0 6,893      9,008,024      8,862,000      8,913,000      8,941,000      8,964,000      8,984,000      9,006,000      9,028,000      9,049,000      9,071,000      9,102,000      9,170,000 
    20151109044056 LS_I aa1 6,062      9,018,094      8,867,000      8,913,000      8,938,000      8,961,000      8,983,000      9,003,000      9,025,000      9,048,000      9,071,000      9,103,000      9,175,000 
    20151109044056 LS_I aa2 2,776      9,030,621      8,929,000      8,967,000      8,987,000      8,999,000      9,012,000      9,024,000      9,037,000      9,050,000      9,065,000      9,087,000      9,161,000 
    20151109044056 LS_I aa3 1,074      9,028,744      8,925,000      8,970,000      8,988,000      9,002,000      9,016,000      9,026,000      9,039,000      9,051,000      9,067,000      9,089,000      9,138,000 
    20151109044056 LS_I aa4 6,060      9,003,651      8,874,000      8,935,000      8,958,000      8,976,000      8,991,000      9,005,000      9,019,000      9,033,000      9,049,000      9,071,000      9,121,000 
    20151109044056 LS_I aa5 5,453      9,003,993      8,874,000      8,936,000      8,959,000      8,976,000      8,991,000      9,004,000      9,018,000      9,032,000      9,048,000      9,071,000      9,126,000 
    20151109044056 LS_I aa6 16,384            328            111            165            190            208            227            253            301            362            434            551            997 
    20151109044056 LS_I aa7 16,384            316             58             65             70             76             87            137            308            395            512            702          1,562 

so my dict has first key as Tag column, second key as one of the %column and then the value of this key is all the instances of that value in complete file.

This is my processing code, which is not working.

            while re.match("\d{14}\s.*", curr_line):

                lat_data = curr_line.split()
                tag = lat_data[header.index("TAG")]
                for item in range(len(header)):
                    col = header[item]

                    if '%' in col or\
                       "COUNT" in col or\
                       "MEAN" in col:
                        self.data_dict[tag][col].append(lat_data[item])
                curr_line = lat_file.next()
Hemant
  • 1,313
  • 17
  • 30
  • More information needed. What are 1,2? Are they line numbers? when value will be added to 'a' and when to 'b'? – Ahsanul Haque Nov 10 '15 at 11:33
  • no, they are just key. I used as dummy variable. I made changes to make it more clear – Hemant Nov 10 '15 at 11:34
  • See http://stackoverflow.com/questions/4143698/create-or-append-to-a-list-in-a-dictionary-can-this-be-shortened or http://stackoverflow.com/questions/960733/python-creating-a-dictionary-of-lists on how to either use a defaultdict or the setdefault method – nos Nov 10 '15 at 11:39
  • As it stands, your example is not valid Python. I'm guessing that you meant `"aa1"`, not `aa1`, for example, and that each value is a single dictionary, not a list of dictionaries. Right? Could you clarify? That said, you probably want to take a look at [`collections.defaultdict`](http://docs.python.org/3/library/collections.html#collections.defaultdict). – Tim Pietzcker Nov 10 '15 at 11:40
  • @TimPietzcker I'm not sure how a defaultdict will be able to help, as my dict is a nested dict. – Hemant Nov 10 '15 at 11:49
  • Sample data does not cover all in the attached dictionary structure, aging it is hard to guess the structure of the sample data. – Learner Nov 10 '15 at 12:28

1 Answers1

2

First off: has_key has been deprecated for ages (gone in Py3); you can use direct in checks. Secondly, what you were trying to do with has_key is nonsensical [tag][col] is not legal syntax without something to index (without indexing/looking up something, it looks like two back to back single element list literals, which isn't legal syntax). The fix for the test is to test for each component individually (after which you can append, since you know the value exists):

if tag in self.data_dict and col in self.data_dict[tag]:
    self.data_dict[tag][col].append(whatever_you_want_to_append)

Side-note: You almost never want for i in range(len(something)):; that's a symptom of coming from a C-style for loop background. You're not actually using the index for anything besides getting the value, so replace:

for item in range(len(header)):
    col = header[item]

with:

for col in header:

Runs faster, more idiomatically, etc. If you need the index too for some reason, that's what enumerate is for:

for i, col in enumerate(header):

UPDATE: You updated the question with more info, so it looks like you need to iterate lat_data in parallel. In that case, do:

for col, lat in zip(header, lat_data):
    ...

        if tag in self.data_dict and col in self.data_dict[tag]:
            self.data_dict[tag][col].append(lat)
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271