Sorting data by multiple columns

Question

I have a question about sorting data by multiple columns. I'm definitely a beginner at this and am wondering how I can sort by one column and then by another without losing the ordering of the first column. I have a file of tab separated data consisting of three columns. The majority of the data isn't paired (one id, first column, and position start and end, second and third columns). Occasionally, however, there are multiple entries for the same ID (first column). These need to remain grouped together (without a space separating them from the next entry, unless it has a different ID). The data is really already sorted with respect to the first column, but I need to sort it numerically based on the starting position (second column) while preserving the original sorting. Like this:

Current format:

PITG_00129  606 1436

PITG_00130  1   987

PITG_00132  2   1321

PITG_00133 4464 11708
PITG_00133 1 2946
PITG_00133 4081 4515

Desired format:

PITG_00129  606 1436

PITG_00130  1   987

PITG_00132  2   1321

PITG_00133 1 2946
PITG_00133 4081 4515
PITG_00133 4464 11708

You can use the `sort` command if you are using some kind of *nix environment. — GWW, Jan 15 '13 at 21:28

mgilson · Accepted Answer · 2013-01-15T22:07:05.303

You can do this pretty easily in python. First, you need to read your data in a proper format:

def line_to_tuple(line):
    data = line.split()
    return (data[0],int(data[1]),int(data[2]))

This will turn each line into a tuple which will sort lexicographically. Since your strings (the first column) are set up in an easily sorted manner, we don't need to worry about them. The second and third columns just need to be converted to integers to make them sort properly.

with open(inputfile) as fin, open(outputfile,'w') as fout:
    non_blank_lines = (line for line in fin if line.strip())
    sorted_lines = sorted(non_blank_lines,key=line_to_tuple)
    fout.writelines(sorted_lines)

Here's another implementation to preserve blank lines between fields:

import itertools
def field1(line):
    data = line.split()
    try:
        return data[0]
    except IndexError:
        return None

def fields(line):
    data = line.split()
    return data[0],int(data[1]),int(data[2])

with open('test.dat') as fin, open('output.dat','w') as fout:
    for k,v in itertools.groupby(fin,key=field1):
        if k is None:
            fout.write('\n')
        else:
            fout.writelines(sorted(v,key=fields))

This uses itertools to chunk up the file based on the empty lines and sorts those groups individually before writing them back out.

Here's the output:

temp $ cat output.dat 
PITG_00129  606 1436

PITG_00130  1   987

PITG_00132  2   1321

PITG_00133 1 2946
PITG_00133 4081 4515
PITG_00133 4464 11708

It seems to me they wanted to retain blank lines between different ID groups. — georg, Jan 15 '13 at 21:49
Yeah, maybe? That could be accomplished with a simple `itertools.groupby` — mgilson, Jan 15 '13 at 21:52
@thg435 -- Added a solution that preserves those blank lines between ID groups. — mgilson, Jan 15 '13 at 22:11
Thanks for the detailed reply. Amazingly, I feel like I actually understand what you did haha. Ran it and it works out great. Thanks again. — user1784467, Jan 15 '13 at 22:24

score 3 · Answer 2 · answered Jan 15 '13 at 22:48

This is pretty trivial in perl, and always with perl, TIMTOWDI (there is more than 1 way to do it.) The two example below produce equivalent results.

Option 1 - Using the or comparison operator in your sort. This example has an array of objects. It will first sort by last_name, if they are equal it will check the first_name. This is good if you aren't going to be resorting your data over and over again by different variables.

my @records = (...);

my @sorted = sort { $a->last_name cmp $b->last_name ||
                    $a->first_name cmp $b->first_name } @records;

Option 2 - Using the sort pragma. Perl's sort isn't stable by default, but you can make it stable with the sort pragma. This is the better option if you will be resorting the same data multiple times by different variables and want to keep the order from the last sort intact.

use sort 'stable';

@records = sort { $a->first_name cmp $b->first_name } @records;

@records = sort { $a->last_name cmp $b->last_name } @records;

`sort` is stable since 5.7, see the [doc](http://perldoc.perl.org/functions/sort.html). — amon, Jan 15 '13 at 23:11

score 2 · Answer 3 · edited May 23 '17 at 12:27

2

If you perform one sort after another, and the sorts are stable, the order of the first sort will be retained for equal keys of the second sort. I believe Python's sort function is stable, not sure about Perl.

edited May 23 '17 at 12:27

Community

1
1

answered Jan 15 '13 at 21:28

Mark Ransom

299,747
42
398
622

Colonel Panic · Answer 4 · 2013-01-16T12:12:22.057

1

To sort a list of comma-separated value by the first column, then the second, run a command such as:

csvfix sort -f 1,2 data.csv

http://csvfix.byethost5.com/csvfix15/csvfix.html

You'll first need to find-and-replace your tabs for commas.

edited Jan 16 '13 at 12:12

answered Jan 15 '13 at 21:31

Colonel Panic

132,665
89
401
465

score 1 · Answer 5 · answered Jan 15 '13 at 21:32

1

algorithm

 1.  Get the unique first column values
 2.  for each first column value grep the origiginal file and redirect to temp file
       you will get the partial output in temp file for each unique value.  you can use sort command to sort by 2nd column on the temp file and redirect the output (append) to output file.

answered Jan 15 '13 at 21:32

Satish

713
1
5
18

+1 for encouraging implementation of algorithm as opposed to just code – PinkElephantsOnParade Jan 15 '13 at 21:55

score 1 · Answer 6 · edited Jan 15 '13 at 23:38

Here's a Perl option that uses a Schwartzian transform to sort the clustered lines on the first numeric value. The transform tags the thing that is to be sorted with a key that can be sorted easily. In the example, the map tags all lines. The sort block then specifies how these tagged values are to be sorted: by their second element. The untagged value is then printed.

use strict;
use warnings;

local $/ = ''; #paragraph mode

while (<>) {
    print $_->[0], "\n"
      for sort { $a->[1] <=> $b->[1] }
          map  { [ $_, (split)[1] ] }
               split /\n/;

    print "\n";
}

Invocation: $ perl script.pl data.txt > data_sorted.txt

As a one-liner (perl 5.10 or later):

perl -nE'BEGIN{$/=""}say$$_[0]for sort{$$a[1]<=>$$b[1]}map[$_,(split)[1]],split/\n/;say""' data.txt >data_sorted.txt

data.txt:

PITG_00129  606 1436

PITG_00130  417   987
PITG_00130  1   987

PITG_00132  2   1321

PITG_00133 4464 11708
PITG_00133 1 2946
PITG_00133 4081 4515

PITG_00134 446 1170
PITG_00134 41 294
PITG_00134 17 451

data_sorted.txt:

PITG_00129  606 1436

PITG_00130  1   987
PITG_00130  417   987

PITG_00132  2   1321

PITG_00133 1 2946
PITG_00133 4081 4515
PITG_00133 4464 11708

PITG_00134 17 451
PITG_00134 41 294
PITG_00134 446 1170

Sorting data by multiple columns

6 Answers6