0

I want to sort my tab-delimited data file containing 15 columns according to column[0], i.e. my input file (I illustrate only column 0)

Input file and desired Output file

contig1               contig1
contig102             contig1
contig405             contig2
contig1               contig17
contig2               contig102
contig1005            contig405
contig17              contig1005

The script below sorts, but since 1 < 2, it gives me all contigs having 1 then passes to 2, also since 0 < 1, is gives me 102 before 2, how to improve it?

f1 = open('file.txt','r')
a=sorted(f1.readlines(), key=lambda l: l.split()[0]))
r=open('file.txt','w')
r.writelines(a)
f1.close
user3224522
  • 1,119
  • 8
  • 19
  • 4
    Assuming they all start with contig, you could just sort by int(s[6:]) – Antimony Apr 03 '14 at 14:02
  • if that's a possibility, ideally you should have unsignificant `0` digits before the number. Another possibility, if you always have string + number order, would be to split , order by string, then by number, but that's a bit more tricky... And if the string is always the same, Antimony's suggestion is definitely the good one. – Laurent S. Apr 03 '14 at 14:02
  • 1
    possible duplicate of [Does Python have a built in function for string natural sort?](http://stackoverflow.com/questions/4836710/does-python-have-a-built-in-function-for-string-natural-sort) – RedX Apr 03 '14 at 14:08
  • it could be a duplicate, but I am specificallyasking how to improve the script I wrote above – user3224522 Apr 03 '14 at 14:15

2 Answers2

2

How about this one:

import re

def alphanumsort(x):
    reg = re.compile('(\d+)')
    splitted = reg.split(x)
    return [int(y) if y.isdigit() else y for y in splitted]

print sorted(["contig1","contig20","bart30","bart03"], key = alphanumsort)
dorvak
  • 9,219
  • 4
  • 34
  • 43
1

If

l.split()[0]

gives

contig1
contig102

You want to sort on

int(l.split()[0][6:])

which is

1
102

Do

a = sorted(f1, key=lambda l: int(l.split()[0][6:]))
  • There is no need to call `readlines()`, just pass in the file: `sorted(f1, key=...)`. – Bakuriu Apr 03 '14 at 14:08
  • @Bakuriu Right, I've changed that. –  Apr 03 '14 at 14:15
  • @dorvak Yes. But does it change? Not in the question. Let's assume the question makes sense as it is asked :) –  Apr 03 '14 at 14:16
  • Jepp, that's right! (Your solution might be faster than mine (did'nt test this), but it's less generic) – dorvak Apr 03 '14 at 14:18