Python to do data matching and merging

Question

The input file is a tab delimited unicode txt with

a  A   e  f  m
b  B   g  h
c  C   i  j
b  B   k  l

I want to match by the first and second column and merge. So I want to get

a  A   e  f  m
b  B   g  h     k  l
c  C   i  j

The code has to detect the maximum number of columns in the input. Since it is 5 in this example, "k l" were put from 6th column.

Actually I almost managed to do this using Matlab when they are all numbers. But oh, when they were letters, Matlab was so bad at handling unicode, although I read stackoverflow about how to deal with unicode in Matlab I gave up. So I now turned to python.

With excel VBA, it seemed doable but since the data size is so huge so I guessed python will be faster than Excel VBA (Am I guessing correctly?)

This is a really easy exercise in `awk`: `awk -F\t '{a=$1 "\t" $2; $1=$2=""; x[a] = x[a] $0} END {for(y in x) print y,x[y]}'` — SheetJS, Aug 10 '13 at 18:23
@Nirk According to guide of stackoverflow your answer should not be posted as comment see this http://stackoverflow.com/help/privileges/comment . — Madan Ram, Aug 10 '13 at 18:53
@MadanRam it wasn't an answer insofar as the question involved python and I didn't give a python answer. It is best laid as a comment, unless user2604484 decides that awk is an acceptable solution — SheetJS, Aug 10 '13 at 22:04
@Nirk Oh Thank you very much. I didn't know there is that type of solution. Since you are not posting this as answer because I didn't specifically asked about awk, I posted a similar question at http://stackoverflow.com/questions/18168664/awk-or-gawk-to-do-data-matching-and-merging Do you have an answer there? Thanks again. — user1849133, Aug 11 '13 at 04:13

score 0 · Answer 1 · answered Aug 10 '13 at 19:33

I would personally separate the input/processing logic from the output/formatting logic.

def match_merge(filename):
    with open(filename) as f:
        c = csv.reader(f,delimiter='\t')
        d = defaultdict(list)
        for line in c:
            d[(line[0],line[1])].append(line[2:])
    return d

Demo:

In [17]: import pprint

In [18]: pprint.pprint(match_merge('tabdelim'))
{('a', 'A'): [['e', 'f', 'm']],
 ('b', 'B'): [['g', 'h'], ['k', 'l']],
 ('c', 'C'): [['i', 'j']]}

You can take the dict (defaultdict actually) that match_merge returns and write that out to a tab-delim file easily. Since I didn't really understand your output needs, I omitted that part.

Python to do data matching and merging

1 Answers1