Grouping, sorting lines and removing redundancies

Question

I have an input file having 15 columns,

con13   tr|M0VCZ1|  91.39   267 23  0   131 211 1   267 1   480 239 267 33.4    99.6
con13   tr|M8B287|  97.12   590 17  0   344 211 1   267 0   104 239 590 74.0    99.8
con15   tr|M0WV77|  92.57   148 11  0   73  516 1   148 2   248 256 148 17.3    99.3
con15   tr|C5WNQ0|  85.14   148 22  0   73  516 1   178 4   233 256 148 17.3    99.3
con15   tr|B8AQC2|  83.78   148 24  0   73  516 1   148 6   233 256 148 17.3    99.3
con18   tr|G9HXG9|  99.66   293 1   0   144 102 1   293 7   527 139 301 63.1    97.0
con18   tr|M0XCZ0|  98.29   293 5   0   144 102 1   293 2   519 139 301 63.1    97.0

I need to 1) group and iterate inside each con (using groupby), 2) sort line[2] from lowest to highest value, 3) see inside each group if line[0], line[8] and line[9] are similar, 4) if they are similar, remove repetitive elements and print the results in a new .txt file choosing the one that has highest value in line[2], so that my output file looks like this,

con13   tr|M8B287|  97.12   590 17  0   344 211 1   267 0   104 239 590 74.0    99.8
con15   tr|M0WV77|  92.57   148 11  0   73  516 1   148 2   248 256 148 17.3    99.3
con15   tr|C5WNQ0|  85.14   148 22  0   73  516 1   178 4   233 256 148 17.3    99.3
con18   tr|G9HXG9|  99.66   293 1   0   144 102 1   293 7   527 139 301 63.1    97.0

My attempted script, prints only one single con and does not sort,

from itertools import groupby
f1 = open('example.txt','r')
f2 = open('result1', 'w')
f3 = open('result2.txt','w')
for k, g in groupby(f1, key=lambda x:x.split()[0]): 
    seen = set()
    for line in g:
        hsp = tuple(line.rsplit())
if hsp[8] and hsp[9] not in seen:
    seen.add(hsp)
    f2.write(line.rstrip() + '\n') 
else:
    f3.write(line.rstrip() + '\n')

Your `if hsp[8] and hsp[9] not in seen:` line runs into the classical [if x or y or z == blah](http://stackoverflow.com/q/15112125) problem. Not to mention that `seen` stores *whole lines*. — Martijn Pieters, Apr 15 '14 at 16:31
yeah, it is tab separated and my 1st column is already sorted, but column 3 I need to sort inside each group — user3224522, Apr 15 '14 at 16:34
Your 3rd column is a floating point value? My answer sorts these as if they are strings, so lexicographically. This works fine *as long as the values have the same number of digits*. So sorting `11.12` and `98.75` works, `8.4` and `78.42` does *not* work, in which case you'd have to explicitly turn that column to floats while sorting. Not hard to do, but bears making explicit. — Martijn Pieters, Apr 15 '14 at 16:50
yeah, they are floats, but they all have the same number of digits.. — user3224522, Apr 15 '14 at 17:03
Specification 4) says *results in a new .txt file choosing the one that has highest value* - do you only want a single line of a group in the new file? The one with the *largest* value in the 3rd column? — wwii, Apr 15 '14 at 17:04
in this particular example, If there are several lines inside each group having similar elements in line[0],line[8] and line[9] I need it to print me the one that has highest value in line[2]. In con15 in output instead there are two lines since elements in lines are not similar.. — user3224522, Apr 15 '14 at 17:18

Martijn Pieters · Accepted Answer · 2014-04-15T17:58:57.753

Use the csv module to pre-split your lines for you and write out formatted data again, and use a tuple in seen (of just the 9th and 10th columns) to track similar rows:

import csv
from itertools import groupby
from operator import itemgetter

with open('example.txt','rb') as f1
    with open('result1', 'wb') as f2, open('result2.txt','wb') as f3):
        reader = csv.reader(f1, delimiter='\t')
        writer1 = csv.writer(f2, delimiter='\t')
        writer2 = csv.writer(f3, delimiter='\t')

        for group, rows in groupby(reader, itemgetter(0)):
            rows = sorted(rows, key=itemgetter(8, 9, 2))
            for k, rows in groupby(rows, itemgetter(8, 9)):
                # now we are grouping on columns 8 and 9,
                # *and* these are sorted on column 2
                # everything but the *last* row is written to writer2
                rows = list(rows)
                writer1.writerow(rows[-1])
                writer2.writerows(rows[:-1])

The sorted(rows, key=itemgetter(2)) call sorts the grouped rows (so all rows with the same row[0] value) on the 3rd column.

Because you then want to write only the row with the highest value in column 2 *per group of rows with column 8 and 9 equal) to the first result file, we group again, but sorted on columns 8, 9 and 2 (in that order), then group on just columns 8 and 9 giving us sorted groups in ascending order for column 2. The last row is then written to result1, the rest to result2.txt.

Thank you, works! Just a question, what if I also need to get only one row that has the max value in column 3? — user3224522, Apr 15 '14 at 17:03
Store the `sorted()` result in a variable (`rows = sorted(rows, key=itemgetter(2))`), then you can get the row with the maximum value for that column with `max(rows, key=itemgetter(3))`; same caveats about strings apply; use `max(rows, key=lambda r: int(r[3]))` otherwise. — Martijn Pieters, Apr 15 '14 at 17:06
@user3224522: actually, I missed your 4th requirement there; updated the answer to handle that better. — Martijn Pieters, Apr 15 '14 at 17:59

Grouping, sorting lines and removing redundancies

1 Answers1