3

I have an input file,

10N06_64  sc635516  93.93   100.0
10N06_64  sc711028  93.99   100.0
10N06_64  sc255425  93.46   95.8
10N06_64  sc115511  87.5    93.0
116F19_238  sc121016    91.30   12.1
116F19_238  sc1132492   90.94   6.1
116F19_238  sc513573    87.38   6.1
116F19_238  sc68511 75.93   10.5

I need to group and iterate inside each line[0],and print the 3 lines choosing the ones that have highest values in line[3] and line[2] so that my output file looks like this:

10N06_64  sc635516  93.93   100.0
10N06_64  sc711028  93.99   100.0
10N06_64  sc255425  93.46   95.8
116F19_238  sc121016    91.30   12.1
116F19_238  sc68511 75.93   10.5
116F19_238  sc1132492   90.94   6.1

This is my try, but it prints me only one best line, how to modify it to print me 3 best hits?

import csv
from itertools import groupby
from operator import itemgetter
with open('myfile','rb') as f1:
    with open('outfile', 'wb') as f2:
        reader = csv.reader(f1, delimiter='\t')
        writer1 = csv.writer(f2, delimiter='\t')
        for group, rows in groupby(reader, itemgetter(0)):
            best = max(rows, key=lambda r: (float(r[3]), float(r[2])))
            writer1.writerow(best)
user3224522
  • 1,119
  • 8
  • 19

4 Answers4

3

You could use heapq.nlargest() to get the lines with highest values:

#!/usr/bin/env python
import csv
import sys
from heapq import nlargest
from itertools import groupby

writerows = csv.writer(sys.stdout, delimiter='\t').writerows
for _, rows in groupby(csv.reader(sys.stdin, delimiter='\t'), key=lambda r: r[0]):
    writerows(nlargest(3, rows, key=lambda row: (float(row[3]), float(row[2]))))

Example:

$ <input.csv ./your-script >output.csv

Output

10N06_64    sc711028    93.99   100.0
10N06_64    sc635516    93.93   100.0
10N06_64    sc255425    93.46   95.8
116F19_238  sc121016    91.30   12.1
116F19_238  sc68511 75.93   10.5
116F19_238  sc1132492   90.94   6.1

nlargest() allows to avoid loading the input groups into memory. If number of rows is always small then you could also use sorted(iterable, key=key, reverse=True)[:n].

jfs
  • 399,953
  • 195
  • 994
  • 1,670
2

using sorted method to your code

Input:

10N06_64    sc635516    93.93   100.0
10N06_64    sc711028    93.99   100.0
10N06_64    sc255425    93.46   95.8
10N06_64    sc115511    87.5    93.0
116F19_238  sc121016    91.30   12.1
116F19_238  sc1132492   90.94   6.1
116F19_238  sc513573    87.38   6.1
116F19_238  sc68511 75.93   10.5

Code:

import csv
from itertools import groupby
from operator import itemgetter
with open('word.txt','rb') as f1:
        reader = csv.reader(f1, delimiter='\t')
        for group, rows in groupby(reader, itemgetter(0)):
            best = sorted(rows, key=lambda r: (float(r[3]), float(r[2])),reverse=True)[:3]
            for a in best:
                print a
            print "\n"

Output:

['10N06_64', 'sc711028', '93.99', '100.0']
['10N06_64', 'sc635516', '93.93', '100.0']
['10N06_64', 'sc255425', '93.46', '95.8']


['116F19_238', 'sc121016', '91.30', '12.1']
['116F19_238', 'sc68511', '75.93', '10.5']
['116F19_238', 'sc1132492', '90.94', '6.1']
Community
  • 1
  • 1
The6thSense
  • 8,103
  • 8
  • 31
  • 65
2

You can try this:

import csv
from itertools import groupby
from operator import itemgetter

take = 3

with open('myfile','rb') as f1:
    with open('outfile', 'wb') as f2:
        reader = csv.reader(f1, delimiter='\t')
        writer1 = csv.writer(f2, delimiter='\t')
        for group, rows in groupby(reader, itemgetter(0)):
            sorted_items = sorted(rows, key=lambda r: (float(r[3]), float(r[2])), reverse=True)
            for item in sorted_items[:take]:
                writer1.writerow(item)

The sorted function acts like the max and orders items by a key you provide to it.

avenet
  • 2,894
  • 1
  • 19
  • 26
1

#you need use if to identfy the 3 best hits, for example:

for x  in table:
    if x > number1
        number1 = x
    elif x > number2
        number2 = x
    elif x > number3
        number3 = x

print number1, number2, number3