2

In my data:

myData='''pos\tidx1\tval1\tidx2\tval2
11\t4\tC\t6\tA
15\t4\tA\t6\tT
23\t4\tT\t6\tT
28\t4\tA\t3\tG
34\t4\tG\t3\tC
41\t4\tC\t4\tT
51\t4\tC\t4\tC'''

I read the this data with header as keys, csv.DictReader.

import csv
import itertools

input_file = csv.DictReader(io.StringIO(myData), delimiter = '\t')
# which produces an iterator

''' Now, I want to group this dictionary by idx2, where
idx2 values is the main key and other have values merged into list that have same keys'''

# This groupby method give me
file_blocks = itertools.groupby(input_file, key=lambda x: x['idx2'])

# I can print this as
for index, blocks in file_blocks:
    print(index, list(blocks))

6 [{'val2': 'A', 'val1': 'C', 'idx1': '4', 'pos': '11', 'idx2': '6'}, {'val2': 'T', 'val1': 'A', 'idx1': '4', 'pos': '15', 'idx2': '6'}, {'val2': 'T', 'val1': 'T', 'idx1': '4', 'pos': '23', 'idx2': '6'}]
3 [{'val2': 'G', 'val1': 'A', 'idx1': '4', 'pos': '28', 'idx2': '3'}, {'val2': 'C', 'val1': 'G', 'idx1': '4', 'pos': '34', 'idx2': '3'}]
4 [{'val2': 'T', 'val1': 'C', 'idx1': '4', 'pos': '41', 'idx2': '4'}, {'val2': 'C', 'val1': 'C', 'idx1': '4', 'pos': '51', 'idx2': '4'}]

But, since the output is exhausted I can't print, use it more than once to debug it.

So, problem #1: how to I convert it into non iter-type data.

Problem #2: how can I process this groupby object further to merge the values to a list that have common keys within same group/blocks.

Something like orderedDict, defaultDict where the order of the way the data is read is preserved:

{'6': defaultdict(<class 'list'>, {'pos': [11, 15, 23], 'idx1': [4, 4, 4], 'val1': ['C', 'A', 'T'], 'idx2': [6, 6, 6], 'val2': ['A', 'T', 'T']})}
{'3': .....
{'4': .....

Some of the fixes I tried:

I rather thought I could prepare a keys:[values] by unique keys before grouping:

update_dict = {}
for lines in input_file:
print(type(lines))
for k, v in lines:
    update_dict['idx2'] = lines[k,v]

Other thing I tried was to se if I can merge the data inside the grouped object: new_groupBy = {} for index, blocks in file_blocks: print(index, list(blocks)) for x in blocks: for k, v in x: do something for new_groupBy

everestial007
  • 6,665
  • 7
  • 32
  • 72

2 Answers2

1

So, as for your first problem, you can simply materialize a list:

In [9]: raw_data='''pos\tidx1\tval1\tidx2\tval2
    ...: 11\t4\tC\t6\tA
    ...: 15\t4\tA\t6\tT
    ...: 23\t4\tT\t6\tT
    ...: 28\t4\tA\t3\tG
    ...: 34\t4\tG\t3\tC
    ...: 41\t4\tC\t4\tT
    ...: 51\t4\tC\t4\tC'''

In [10]: data_stream = csv.DictReader(io.StringIO(raw_data), delimiter="\t")

In [11]: grouped = itertools.groupby(data_stream, key=lambda x:x['idx2'])

In [12]: data = [(k,list(g)) for k,g in grouped] # order is important, so use a list

In [13]: data
Out[13]:
[('6',
  [{'idx1': '4', 'idx2': '6', 'pos': '11', 'val1': 'C', 'val2': 'A'},
   {'idx1': '4', 'idx2': '6', 'pos': '15', 'val1': 'A', 'val2': 'T'},
   {'idx1': '4', 'idx2': '6', 'pos': '23', 'val1': 'T', 'val2': 'T'}]),
 ('3',
  [{'idx1': '4', 'idx2': '3', 'pos': '28', 'val1': 'A', 'val2': 'G'},
   {'idx1': '4', 'idx2': '3', 'pos': '34', 'val1': 'G', 'val2': 'C'}]),
 ('4',
  [{'idx1': '4', 'idx2': '4', 'pos': '41', 'val1': 'C', 'val2': 'T'},
   {'idx1': '4', 'idx2': '4', 'pos': '51', 'val1': 'C', 'val2': 'C'}])]

As for your second problem, try something like:

In [15]: import collections

In [16]: def accumulate(data):
    ...:     acc = collections.OrderedDict()
    ...:     for d in data:
    ...:         for k,v in d.items():
    ...:             acc.setdefault(k,[]).append(v)
    ...:     return acc
    ...:

In [17]: grouped_data = {k:accumulate(d) for k,d in data}

In [18]: grouped_data
Out[18]:
{'3': OrderedDict([('pos', ['28', '34']),
              ('idx2', ['3', '3']),
              ('val2', ['G', 'C']),
              ('val1', ['A', 'G']),
              ('idx1', ['4', '4'])]),
 '4': OrderedDict([('pos', ['41', '51']),
              ('idx2', ['4', '4']),
              ('val2', ['T', 'C']),
              ('val1', ['C', 'C']),
              ('idx1', ['4', '4'])]),
 '6': OrderedDict([('pos', ['11', '15', '23']),
              ('idx2', ['6', '6', '6']),
              ('val2', ['A', 'T', 'T']),
              ('val1', ['C', 'A', 'T']),
              ('idx1', ['4', '4', '4'])])}

Note, I used list (and dict) comprehensions. They are work similarly. The list comprehension is equivalent to:

data = []
for k, g in grouped:
    data.append((k, list(g))

And for good measure, here's the equivalent to a dict-comprehension, although I'm using an OrderedDict, since in any case, order seems to be important:

In [20]: grouped_data = collections.OrderedDict()

In [21]: for k, d in data:
    ...:     grouped_data[k] = accumulate(d)
    ...:

In [22]: grouped_data
Out[22]:
OrderedDict([('6',
              OrderedDict([('val2', ['A', 'T', 'T']),
                           ('val1', ['C', 'A', 'T']),
                           ('pos', ['11', '15', '23']),
                           ('idx2', ['6', '6', '6']),
                           ('idx1', ['4', '4', '4'])])),
             ('3',
              OrderedDict([('val2', ['G', 'C']),
                           ('val1', ['A', 'G']),
                           ('pos', ['28', '34']),
                           ('idx2', ['3', '3']),
                           ('idx1', ['4', '4'])])),
             ('4',
              OrderedDict([('val2', ['T', 'C']),
                           ('val1', ['C', 'C']),
                           ('pos', ['41', '51']),
                           ('idx2', ['4', '4']),
                           ('idx1', ['4', '4'])]))])

Note, we can do everything in a single pass, avoiding creating unnecessary data-structures:

import itertools, io, csv, collections

data_stream = csv.DictReader(io.StringIO(raw_data), delimiter="\t")
grouped = itertools.groupby(data_stream, key=lambda x:x['idx2'])

def accumulate(data):
    acc = collections.OrderedDict()
    for d in data:
        for k,v in d.items():
            acc.setdefault(k,[]).append(v)
    return acc

grouped_data = collections.OrderedDict()
for k, g in grouped:
    grouped_data[k] = accumulate(g)
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • thanks @juanpa. I was now trying to access the data (grouped_data) in a forloop. `for x, y in grouped_data: print(x)` But i get this error message `for x, y in grouped_data: ValueError: not enough values to unpack (expected 2, got 1)` – everestial007 Jan 10 '18 at 23:49
  • @everestial007 You are asking how to iterate over a dictionary (`OrderedDict` objects *are* `dict` objects) see [this question](https://stackoverflow.com/questions/3294889/iterating-over-dictionaries-using-for-loops) – juanpa.arrivillaga Jan 10 '18 at 23:51
  • Now, from the whole ordered dict, I am trying to access nested `OrderedDict` and the `keys:values` for further processing. – everestial007 Jan 10 '18 at 23:51
  • @everestial007 they are *dictionaries*. Consider `grouped_data['6']['pos']` – juanpa.arrivillaga Jan 10 '18 at 23:52
  • That worked but I don't seem to understand the output. `for k, v in grouped_data['6']['pos']: print(k)` outputs: 1 1 2. I don't see what is is outputting. – everestial007 Jan 10 '18 at 23:58
  • @everestial007 look, I'm not trying to be mean, but do you understand how dictionaries work? I'm pretty sure that expression will error out. `grouped_data['6']['pos']` returns a *list* that is from index `'6'` and is the grouped `'pos'` fields... – juanpa.arrivillaga Jan 10 '18 at 23:59
  • I only briefly understand how dictionary work. What you have done just flew over my head. I will try to comprehend it. But, I don't seem to where is that `1 1 2` coming from. There is no such values in index '6'. Anyway it's fine. – everestial007 Jan 11 '18 at 00:03
  • @everestial007 try `print(grouped_data['6'])` then try `print(grouped_data['6']['pos']` – juanpa.arrivillaga Jan 11 '18 at 00:13
  • @pylang: Thanks for being nice. Well, I got that this time. Also, i realized that I can iterate over using `for k, v in grouped_data.items(): print(k, v)` – everestial007 Jan 11 '18 at 00:28
1

Given

import io
import csv
import itertools as it
import collections as ct    

data="""pos\tidx1\tval1\tidx2\tval2
11\t4\tC\t6\tA
15\t4\tA\t6\tT
23\t4\tT\t6\tT
28\t4\tA\t3\tG
34\t4\tG\t3\tC
41\t4\tC\t4\tT
51\t4\tC\t4\tC"""

Part I

how to I convert it into non iter-type data

Code

Here's how to retain data from the iterators - simply cast it to a list:

>>> input_file = list(csv.DictReader(io.StringIO(data), delimiter = "\t"))
>>> input_file
[{'idx1': '4', 'idx2': '6', 'pos': '11', 'val1': 'C', 'val2': 'A'},
 {'idx1': '4', 'idx2': '6', 'pos': '15', 'val1': 'A', 'val2': 'T'},
 {'idx1': '4', 'idx2': '6', 'pos': '23', 'val1': 'T', 'val2': 'T'},
 {'idx1': '4', 'idx2': '3', 'pos': '28', 'val1': 'A', 'val2': 'G'},
 {'idx1': '4', 'idx2': '3', 'pos': '34', 'val1': 'G', 'val2': 'C'},
 {'idx1': '4', 'idx2': '4', 'pos': '41', 'val1': 'C', 'val2': 'T'},
 {'idx1': '4', 'idx2': '4', 'pos': '51', 'val1': 'C', 'val2': 'C'}]

Or use a list comprehension:

>>> file_blocks = [(k, list(g)) for k, g in it.groupby(input_file, key=lambda x: x["idx2"])]
>>> file_blocks
[('6',
  [{'idx1': '4', 'idx2': '6', 'pos': '11', 'val1': 'C', 'val2': 'A'},
   {'idx1': '4', 'idx2': '6', 'pos': '15', 'val1': 'A', 'val2': 'T'},
   {'idx1': '4', 'idx2': '6', 'pos': '23', 'val1': 'T', 'val2': 'T'}]),
 ('3',
  [{'idx1': '4', 'idx2': '3', 'pos': '28', 'val1': 'A', 'val2': 'G'},
   {'idx1': '4', 'idx2': '3', 'pos': '34', 'val1': 'G', 'val2': 'C'}]),
 ('4',
  [{'idx1': '4', 'idx2': '4', 'pos': '41', 'val1': 'C', 'val2': 'T'},
   {'idx1': '4', 'idx2': '4', 'pos': '51', 'val1': 'C', 'val2': 'C'}])]

Now you can reuse data from input_file and file_blocks.


Part II

how can I process this groupby object further to merge the values to a list that have common keys within same group/blocks...

Something like orderedDict, defaultDict where the order of the way the data is read is preserved

def collate_data(data):
    """Yield an OrderedDict of merged dictionaries from `data`."""
    for idx, item in data:
        results = ct.OrderedDict()
        dd = ct.defaultdict(list)
        for dict_ in item:
            for k, v in dict_.items():
                dd[k].append(v)
        results[idx] = dd
        yield results
    

list(collate_data(file_blocks))

Output

[OrderedDict([('6',
               defaultdict(list,
                           {'idx1': ['4', '4', '4'],
                            'idx2': ['6', '6', '6'],
                            'pos': ['11', '15', '23'],
                            'val1': ['C', 'A', 'T'],
                            'val2': ['A', 'T', 'T']}))]),
 OrderedDict([('3',
               defaultdict(list,
                           {'idx1': ['4', '4'],
                            'idx2': ['3', '3'],
                            'pos': ['28', '34'],
                            'val1': ['A', 'G'],
                            'val2': ['G', 'C']}))]),
 OrderedDict([('4',
               defaultdict(list,
                           {'idx1': ['4', '4'],
                            'idx2': ['4', '4'],
                            'pos': ['41', '51'],
                            'val1': ['C', 'C'],
                            'val2': ['T', 'C']}))])]

The order of the itertools.groupby() elements is maintained by a collections.OrderedDict(). The order of values across the lines of the file (see the dicts in input_file) is preserved by the list inside the collections.defaultdict() object.

Community
  • 1
  • 1
pylang
  • 40,867
  • 14
  • 129
  • 121