Joining all rows of a CSV file that have the same 1st column value in Python

Question

I have a CSV file that goes something like this:

['Name1', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '+']
['Name1', '', '', '', '', '', 'b', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['Name2', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'a', '']
['Name3', '', '', '', '', '+', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

Now, I need a way to join all of the rows that have the same 1st column name into one column, for instance:

['Name1', '', '', '', '', '', 'b', '', '', '', '', '', '', '', '', '', '', '', '', '', '+']
['Name2', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'a', '']
['Name3', '', '', '', '', '+', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

I can think of a way to do this by sorting the CSV and then going trough each row and column and compare each value, but there should probably be an easier way to do it.

Any ideas?

you should probably be more explicit on what _join_ should do. — moooeeeep, Jun 14 '12 at 11:35
Can the same column be present in two rows with the same first value? What do you want to do in that case? — Charles Brunet, Jun 14 '12 at 11:37
@moooeeeep: Well, I want to join them so that they are like in the second part of the example. — jbssm, Jun 14 '12 at 11:39
@CharlesBrunet: No, for the same name a value can only appear in one of the other columns once for each column. — jbssm, Jun 14 '12 at 11:41

moooeeeep · Accepted Answer · 2012-06-14T12:26:03.140

3

You should use itertools.groupby:

t = [ 
['Name1', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '+'],
['Name1', '', '', '', '', '', 'b', '', '', '', '', '', '', '', '', '', '', '', '', '', ''],
['Name2', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'a', ''],
['Name3', '', '', '', '', '+', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''] 
]

from itertools import groupby

# TODO: if you need to speed things up you can use operator.itemgetter
# for both sorting and grouping
for name, rows in groupby(sorted(t), lambda x:x[0]):
    print join_rows(rows)

It's obvious that you'd implement the merging in a separate function. For example like this:

def join_rows(rows):
    def join_tuple(tup):
        for x in tup:
            if x: 
                return x
        else:
            return ''
    return [join_tuple(x) for x in zip(*rows)]

edited Jun 14 '12 at 12:26

answered Jun 14 '12 at 11:43

moooeeeep

31,622
22
98
187

It doesn't work. It's join_rows a function from some lib or something I must write apart from the code? – jbssm Jun 14 '12 at 11:50
@jbssm the `join_rows` is an entrypoint for your code, it is for you to write ;) – schlamar Jun 14 '12 at 11:58
@moooeeeep do not use `sorted` without key, this is unnecessary runtime. – schlamar Jun 14 '12 at 12:01
@moooeeeep `itemgetter(0)` would be a better approach (see http://stackoverflow.com/a/4174955/851737) – schlamar Jun 14 '12 at 12:02
So it will look like: `for name, rows in groupby(sorted(t, key=itemgetter(0)), itemgetter(0))` – schlamar Jun 14 '12 at 12:03
@ms4py thanks for the note, I added a note in my answer! (the code is indeed not optimized for performance, but for verbosity) – moooeeeep Jun 14 '12 at 12:10

Simeon Visser · Answer 2 · 2012-06-14T11:48:43.060

1

def merge_rows(row1, row2):
    # merge two rows with the same name
    merged_row = ...
    return merged_row

r1 = ['Name1', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '+']
r2 = ['Name1', '', '', '', '', '', 'b', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
r3 = ['Name2', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'a', '']
r4 = ['Name3', '', '', '', '', '+', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
rows = [r1, r2, r3, r4]
data = {}
for row in rows:
    name = row[0]
    if name in data:
        data[name] = merge_rows(row, data[name])
    else:
        data[name] = row

You now have all the rows in data where each key of this dictionary is the name and the corresponding value is that row. You can now write this data to a CSV file.

edited Jun 14 '12 at 11:48

answered Jun 14 '12 at 11:25

Simeon Visser

118,920
18
185
180

Hi and thanks Simeon: I don't understand what is going on in the merged_row part. Where is the previous row(or rows) with the same name stored so that I can merge them? – jbssm Jun 14 '12 at 11:38
The current row that you're processing is `row` and the the other is `data[name]`. The row in `data[name]` is either a previous row with that name or the result of one or more merges of rows with that name. So you only need to write the code that specifies how to merge two rows with the same name. If you write that code for `merged_row` then it'll repeatedly merge rows (even if there are three or more rows with the same name). – Simeon Visser Jun 14 '12 at 11:42
I have updated the code to make it a bit cleared. All you need to do is write `merge_rows` to specify how two rows with the same name need to be merged. – Simeon Visser Jun 14 '12 at 11:49

score 0 · Answer 3 · answered Jun 14 '12 at 12:38

You can also use defaultdict:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> _ = [d[i[0]].append(z) for i in t for z in i[1:]]
>>> d['Name1']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '+', '', '', '', '', '', 'b', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

Then do your column joining

Joining all rows of a CSV file that have the same 1st column value in Python

3 Answers3

Linked