0

I have a file like this:

A2ML1,ENST00000541459
A2ML1,ENST00000545692
A2ML1,ENST00000545850
A3GALT2,ENST00000442999
A4GALT,ENST00000249005
A4GALT,ENST00000381278

And I want to group the lines like this:

A2ML1,ENST00000541459,ENST00000545692,ENST00000545850
A3GALT2,ENST00000442999
A4GALT,ENST00000249005,ENST00000381278

Here is my code in python, that is leaving the file as original XD:

import sys

with open('gene_list.csv', 'r') as file_open:
    iterfile = iter(file_open)
    for line in iterfile:
        l = line.split(",")
        select = l[0]
        linext = iterfile.next()
        linext2 = linext.split(",")
        if select == linext2[0]:
            sys.stdout.write(select + ',' + linext2[1])
            next(file_open)
        else:
            sys.stdout.write(select + ',' + l[1])

I know that it is very easy to do but I am stuck with this. I really apreciate your help. Thanks!

JaimeG
  • 35
  • 5
  • Could you not use a dictionary with the key being the value from the first column, and the value being a list? – Ben Jan 25 '17 at 08:00
  • Hi @Ben, that's a good idea but I have thousands of values. Maybe I should do a list of all values and iterate over them? Thanks! – JaimeG Jan 25 '17 at 08:10

3 Answers3

2

Hope this helps :)

import csv
import collections

#Read in the data as a dictionary
with open('gene_list.csv', 'r') as fd:

    reader = csv.reader(fd)

    #If you have headers in the CSV file you want to skip
    #next(reader, None)

    #This dict will have key:value, value=list type
    unique_first_col = collections.defaultdict(list)
    for row in reader:
        unique_first_col[row[0]].append(row[1])

with open('output.csv', 'w') as fd:

    #Sorted dictionary
    sorted_d = collections.OrderedDict(sorted(unique_first_col.items()))
    for k, v in sorted_d.items():
        fd.write("%s, %s\n" % (k, ','.join(v)))

Notes:

  • see this question for collections.defaultdict
  • See the documentation for information on CSV processing
  • You may want to consider using simple string 'pre-processing' such as strip() before you key into the dictionary, as a trailing space may cause a key to be entered as a new key.
  • See this question for sorting dictionaries
Jamie Phan
  • 1,112
  • 7
  • 15
0

If you need to try pandas you can do this way:-

import pandas as pd
df = pd.read_csv("gene_list.csv", header=None)
df.columns = ["First", "Second"]
df.groupby("First")["Second"].agg({"Second":lambda x:", ".join(x.astype(str))})
Rakesh Kumar
  • 4,319
  • 2
  • 17
  • 30
0

The simple solution is to use the first value as a dictionary key. It is not strictly necessary to use defaultdict, but it makes constructing your list of secondary values easier.

from collections import defaultdict

merged = defaultdict(list)

with open('gene_list.csv', 'r') as f:
    for raw_line in f:
        line = raw_line.strip()
        first, second = line.split(',')
        merged[first].append(second)

for key, values in merged:
    print(key + ',' + ','.join(values))

If you have to assume that your original file can have more than two key, value pairs then you need to tweak this script a little.

instant
  • 676
  • 6
  • 12