9

This seems a very basic question, but I am new to python, and after spending a long time trying to find a solution on my own, I thought it's time to ask some more advanced people!

So, I have a file (sample):

ENSMUSG00000098737  95734911    95734973    3   miRNA
ENSMUSG00000077677  101186764   101186867   4   snRNA
ENSMUSG00000092727  68990574    68990678    11  miRNA
ENSMUSG00000088009  83405631    83405764    14  snoRNA
ENSMUSG00000028255  145003817   145032776   3   protein_coding
ENSMUSG00000028255  145003817   145032776   3   processed_transcript
ENSMUSG00000028255  145003817   145032776   3   processed_transcript
ENSMUSG00000098481  38086202    38086317    13  miRNA
ENSMUSG00000097075  126971720   126976098   7   lincRNA
ENSMUSG00000097075  126971720   126976098   7   lincRNA

and I need to write a new file with all the same information, but sorted by the first column.

What I use so far is :

lines = open(my_file, 'r').readlines()
output = open("intermediate_alphabetical_order.txt", 'w')

for line in sorted(lines, key=itemgetter(0)):
    output.write(line)

output.close()

It doesn't return me any error, but just writes the output file exactly as the input file.

I know it is certainly a very basic mistake, but it would be amazing if some of you could tell me what I'm doing wrong!

Thanks a lot!

Edit

I am having trouble with the way I open the file, so the answers concerning already opened arrays don't really help.

Tiana
  • 95
  • 1
  • 1
  • 4
  • Have you tried reading in by line and zipping? – m_callens Dec 08 '15 at 14:16
  • 1
    Hi, I think this might be answered in http://stackoverflow.com/questions/20099669/python-sort-multidimensional-array-based-on-2nd-element-of-subarray, http://stackoverflow.com/questions/20183069/how-to-sort-multidimensional-array-by-column, ... – bufh Dec 08 '15 at 14:17
  • @bufh Not quite, those explain how to do what the OP is already trying. – SuperBiasedMan Dec 08 '15 at 14:22
  • @bufh Yes, I saw these answers, but the part with which I was stuggling with had to do with the way of reading my file, so an answer already talking about an array didn't help me. Tank you anyway :) – Tiana Dec 08 '15 at 14:29

6 Answers6

8

The problem you're having is that you're not turning each line into a list. When you read in the file, you're just getting the whole line as a string. You're then sorting by the first character of each line, and this is always the same character in your input, 'E'.

To just sort by the first column, you need to split the first block off and just read that section. So your key should be this:

for line in sorted(lines, key=lambda line: line.split()[0]):

split will turn your line into a list, and then the first column is taken from that list.

SuperBiasedMan
  • 9,814
  • 10
  • 45
  • 73
7

If your input file is tab-separated, you can also use the csv module.

import csv
from operator import itemgetter
reader = csv.reader(open("t.txt"), delimiter="\t")

for line in sorted(reader, key=itemgetter(0)):
    print(line)

sorts by first column.

Change the number in

key=itemgetter(0)

for sorting by a different column.

Revan
  • 2,072
  • 4
  • 26
  • 42
2

Same idea as SuperBiasedMan, but I prefer this approach: if you want another way of sorting (for example: if first column matches, sort by second, then third, etc) it is more easily implemented

with open(my_file) as f:
    lines = [line.split(' ') for line in f]
output = open("result.txt", 'w')

for line in sorted(lines):
    output.write(' '.join(line), key=itemgetter(0))

output.close()
Soronbe
  • 906
  • 5
  • 12
1

You can write a function that takes a filename, delimiter and column to sort by using csv.reader to parse the file:

from operator import itemgetter

import  csv

def sort_by(fle,col,delim):
    with open(fle) as f:
        r = csv.reader(f, delim=delim)
        for row in sorted(r, key=itemgetter(col)):
            yield row

for row in sort_by("your_file",2, "\t"):
     print(row)
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
1

You can do this quickly with pandas as follows, with the data file set up exactly as you show it (i.e., with variable spaces as separators):

import pandas as pd
df = pd.read_csv('csvdata.csv', sep=' ', skipinitialspace=True, header=None)
df.sort(columns=[0], inplace=True)
df.to_csv('sorted_csvdata.csv', header=None, index=None)

Just to check the result:

with open('sorted_csvdata.csv', 'r') as f:
    print(f.read())

ENSMUSG00000028255,145003817,145032776,3,protein_coding
ENSMUSG00000028255,145003817,145032776,3,processed_transcript
ENSMUSG00000028255,145003817,145032776,3,processed_transcript
ENSMUSG00000077677,101186764,101186867,4,snRNA
ENSMUSG00000088009,83405631,83405764,14,snoRNA
ENSMUSG00000092727,68990574,68990678,11,miRNA
ENSMUSG00000097075,126971720,126976098,7,lincRNA
ENSMUSG00000097075,126971720,126976098,7,lincRNA
ENSMUSG00000098481,38086202,38086317,13,miRNA
ENSMUSG00000098737,95734911,95734973,3,miRNA

You can do multi column sorting by adding additional columns to the list in the colmuns=[...] keyword argument.

Steve Misuta
  • 1,033
  • 7
  • 7
0

Here is another option. Similar to some of the ideas above. Basically, mysort is a function that will do the custom sorting for you which here is based on

def mysort(line):
    return line.split()[0]

with open("records.txt", "r") as f:
    text = f.readlines()

for line in sorted(text, key=mysort):
    print line