Concatenate values from the same first column in one line

Question

I have a file like this:

A2ML1,ENST00000541459
A2ML1,ENST00000545692
A2ML1,ENST00000545850
A3GALT2,ENST00000442999
A4GALT,ENST00000249005
A4GALT,ENST00000381278

And I want to group the lines like this:

A2ML1,ENST00000541459,ENST00000545692,ENST00000545850
A3GALT2,ENST00000442999
A4GALT,ENST00000249005,ENST00000381278

Here is my code in python, that is leaving the file as original XD:

import sys

with open('gene_list.csv', 'r') as file_open:
    iterfile = iter(file_open)
    for line in iterfile:
        l = line.split(",")
        select = l[0]
        linext = iterfile.next()
        linext2 = linext.split(",")
        if select == linext2[0]:
            sys.stdout.write(select + ',' + linext2[1])
            next(file_open)
        else:
            sys.stdout.write(select + ',' + l[1])

I know that it is very easy to do but I am stuck with this. I really apreciate your help. Thanks!

Could you not use a dictionary with the key being the value from the first column, and the value being a list? — Ben, Jan 25 '17 at 08:00
Hi @Ben, that's a good idea but I have thousands of values. Maybe I should do a list of all values and iterate over them? Thanks! — JaimeG, Jan 25 '17 at 08:10

Jamie Phan · Accepted Answer · 2017-01-25T08:34:41.427

2

Hope this helps :)

import csv
import collections

#Read in the data as a dictionary
with open('gene_list.csv', 'r') as fd:

    reader = csv.reader(fd)

    #If you have headers in the CSV file you want to skip
    #next(reader, None)

    #This dict will have key:value, value=list type
    unique_first_col = collections.defaultdict(list)
    for row in reader:
        unique_first_col[row[0]].append(row[1])

with open('output.csv', 'w') as fd:

    #Sorted dictionary
    sorted_d = collections.OrderedDict(sorted(unique_first_col.items()))
    for k, v in sorted_d.items():
        fd.write("%s, %s\n" % (k, ','.join(v)))

Notes:

see this question for collections.defaultdict
See the documentation for information on CSV processing
You may want to consider using simple string 'pre-processing' such as strip() before you key into the dictionary, as a trailing space may cause a key to be entered as a new key.
See this question for sorting dictionaries

edited Jan 25 '17 at 08:34

answered Jan 25 '17 at 08:03

Jamie Phan

1,112
7
15

2

using `collections.defaultdict(list)` would make things even simpler... – bruno desthuilliers Jan 25 '17 at 08:05
You're right! Thanks! I'm terrible at remembering the simple methods of doing things (there's so many) – Jamie Phan Jan 25 '17 at 08:08
Thank you very much @K.J.Phan! It worked! The only thing was that the first columns were not sorted as before, but all ok after using bash. Thanks! – JaimeG Jan 25 '17 at 08:23

score 0 · Answer 2 · answered Jan 25 '17 at 08:24

0

If you need to try pandas you can do this way:-

import pandas as pd
df = pd.read_csv("gene_list.csv", header=None)
df.columns = ["First", "Second"]
df.groupby("First")["Second"].agg({"Second":lambda x:", ".join(x.astype(str))})

answered Jan 25 '17 at 08:24

Rakesh Kumar

4,319
2
17
30

score 0 · Answer 3 · answered Jan 25 '17 at 08:29

The simple solution is to use the first value as a dictionary key. It is not strictly necessary to use defaultdict, but it makes constructing your list of secondary values easier.

from collections import defaultdict

merged = defaultdict(list)

with open('gene_list.csv', 'r') as f:
    for raw_line in f:
        line = raw_line.strip()
        first, second = line.split(',')
        merged[first].append(second)

for key, values in merged:
    print(key + ',' + ','.join(values))

If you have to assume that your original file can have more than two key, value pairs then you need to tweak this script a little.

Concatenate values from the same first column in one line

3 Answers3