2

I'm surprised that I can't find anything about ranking numbers in python...

Basically, I'm in need of two scripts to do the same task just one in an ascending order and one in a descending order.

row[2] is the numbers to be ranked, row[4] is the cell to put the rank into.

row[0] + row[1] is what defines each dataset/group

In this first example the larger numbers have the higher rank.

CSV Example 1 (Ranks Down)

uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,3,data
uniquedata2,uniquecell2,41,data,2,data
uniquedata2,uniquecell2,39,data,3,data
uniquedata2,uniquecell2,45,data,1,data
uniquedata2,uniquecell2,22,data,4,data
uniquedata1,uniquecell2,36,data,3,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data

In this second example the larger numbers have the lower rank.

CSV Example 2 (Ranks Up)

uniquedata1,uniquecell1,42,data,3,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,1,data
uniquedata2,uniquecell2,41,data,3,data
uniquedata2,uniquecell2,39,data,2,data
uniquedata2,uniquecell2,45,data,4,data
uniquedata2,uniquecell2,22,data,1,data
uniquedata1,uniquecell2,36,data,1,data
uniquedata1,uniquecell2,66,data,3,data
uniquedata1,uniquecell2,40,data,2,data

In this third example, which is ranking up it includes empty cells which should be given the highest ranks (if there are two blanks they should be given the same rank)

CSV Example 3 (Includes empty cells)

uniquedata1,uniquecell1,42,data,2,data
uniquedata1,uniquecell1,,data,3,data
uniquedata1,uniquecell1,13,data,1,data
uniquedata2,uniquecell2,41,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,22,data,1,data
uniquedata1,uniquecell2,36,data,1,data
uniquedata1,uniquecell2,66,data,3,data
uniquedata1,uniquecell2,40,data,2,data

Anyone know how to I can achieve my intended result?

Ryflex
  • 5,559
  • 25
  • 79
  • 148
  • possible duplicate of [sort csv by column](http://stackoverflow.com/questions/2100353/sort-csv-by-column) – YXD Dec 19 '13 at 13:07
  • You need to dream up alternative words that mean similar things and search for them too - e.g. `sort` rather than rank. – doctorlove Dec 19 '13 at 13:10
  • @MrE the problem is I'm unsure how to do the grouping part and I also don't know how to make the exceptions for the spaces. – Ryflex Dec 19 '13 at 13:11
  • If you used the word *sorting* instead of ranking, you'd have found a wealth of information, including the [Python sorting howto](https://wiki.python.org/moin/HowTo/Sorting). Sorting on numbers is trivial, including sorting in reverse. Sorting with blanks (`None`) puts `None` before any numbers, so the **normal sort order** already does *exactly* what you wanted. What you haven't shown is *any* effort to resolve this yourself yet. – Martijn Pieters Dec 21 '13 at 14:22
  • I don't understand your examples. You say the first one is ranked down and that row[2] has the numbers to be ranked. If that were true wouldn't the rows of data in the list be ordered by their third column (or perhaps by their supposed rank in rows[4])? I'd also like to know how you're going to award a bounty of +50 points if your own reputation score isn't even that high. – martineau Dec 21 '13 at 15:14
  • @martineau `row[2]` is the column of data what needs to be ranked and `row[4]` is the rank output. The "datasets" or "groups" is defined by joining `row[0]` and `row[1]` together which in this example gives us 3 unique "classes" to rank individually. My reputation was 87 earlier, I set a bounty and now it's 37. I can still award the bounty. – Ryflex Dec 22 '13 at 00:55
  • @MartijnPieters I've had a look around and at first I was unable to find anything until someone mentioned in python it's called "sorting" which from what I'm used to has a completely different meaning and after that I have no idea how to use the collections library correctly, nor do I know how to rank (sort) in my situation. – Ryflex Dec 22 '13 at 00:57
  • OK. In your examples, `row[4]` already has a value (rank) in it -- so are the rows of data shown the input or desired output? If it's the input, then the value should be ignored and re-computed based on `row[2]` and "group", right? – martineau Dec 22 '13 at 02:13
  • To put things more succinctly: What exactly is the input? Are the examples shown the intended results (which are csv files)? – martineau Dec 22 '13 at 02:49
  • @martineau yes, `row[4]` already has the desired rank/value, everything in there needs to be overwritten by a new calculation because the group size could get new entries or less entries. – Ryflex Dec 22 '13 at 08:43
  • @Hyflex if there're four number `10,20,20,21`, then what should their ranks be? `1,2,3,4`, `1,2,2,4` or `1,2,2,3`? – Timothy Dec 22 '13 at 12:12
  • @Hyflex and should the line order of the output has to be as the same as the input? – Timothy Dec 22 '13 at 12:18
  • @Skyler Based on the numbers you gave me, it would return ranks of `1,2,2,4` as for the line order, yes they should stay in the same order. – Ryflex Dec 22 '13 at 14:16

4 Answers4

4

This is pretty easy if you use pandas.

import pandas as pd

def sorted_df(df, ascending=False):
    grouped = df.groupby([0,1])
    data = []
    for g in grouped:
        d = g[1]
        d[4] = d[2].rank(ascending=ascending)
        d = d.sort(4)
        data.append(d)
    return pd.concat(data)

# load our dataframe from a csv string
import StringIO
f = StringIO.StringIO("""uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,3,data
uniquedata2,uniquecell2,41,data,2,data
uniquedata2,uniquecell2,39,data,3,data
uniquedata2,uniquecell2,45,data,1,data
uniquedata2,uniquecell2,22,data,4,data
uniquedata1,uniquecell2,36,data,3,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data""")

df = pd.read_csv(f, header=None)
# sort descending
sorted_df(df)
=>           0            1   2     3  4     5
0  uniquedata1  uniquecell1  42  data  1  data
1  uniquedata1  uniquecell1  32  data  2  data
2  uniquedata1  uniquecell1  13  data  3  data
8  uniquedata1  uniquecell2  66  data  1  data
9  uniquedata1  uniquecell2  40  data  2  data
7  uniquedata1  uniquecell2  36  data  3  data
5  uniquedata2  uniquecell2  45  data  1  data
3  uniquedata2  uniquecell2  41  data  2  data
4  uniquedata2  uniquecell2  39  data  3  data
6  uniquedata2  uniquecell2  22  data  4  data
# sort ascending
sorted_df(df, ascending=True)
=>           0            1   2     3  4     5
2  uniquedata1  uniquecell1  13  data  1  data
1  uniquedata1  uniquecell1  32  data  2  data
0  uniquedata1  uniquecell1  42  data  3  data
7  uniquedata1  uniquecell2  36  data  1  data
9  uniquedata1  uniquecell2  40  data  2  data
8  uniquedata1  uniquecell2  66  data  3  data
6  uniquedata2  uniquecell2  22  data  1  data
4  uniquedata2  uniquecell2  39  data  2  data
3  uniquedata2  uniquecell2  41  data  3  data
5  uniquedata2  uniquecell2  45  data  4  data
# add some NA values
from numpy import nan
df.ix[1,2] = nan
df.ix[4,2] = nan
df.ix[5,2] = nan
# sort ascending
sorted_df(df, ascending=True)
=>           0            1   2     3   4     5
2  uniquedata1  uniquecell1  13  data   1  data
0  uniquedata1  uniquecell1  42  data   2  data
1  uniquedata1  uniquecell1 NaN  data NaN  data
7  uniquedata1  uniquecell2  36  data   1  data
9  uniquedata1  uniquecell2  40  data   2  data
8  uniquedata1  uniquecell2  66  data   3  data
6  uniquedata2  uniquecell2  22  data   1  data
3  uniquedata2  uniquecell2  41  data   2  data
4  uniquedata2  uniquecell2 NaN  data NaN  data
5  uniquedata2  uniquecell2 NaN  data NaN  data

I think the behavior I have shown here to handle NA values (ranking them as NA) is probably more appropriate than the behavior you have shown in your hypothetical example, but you can fill NA values with whatever you want within each group using fillna.

mattexx
  • 6,456
  • 3
  • 36
  • 47
  • I was working on a version after I found http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.rank.html but yours is much better than my attempt, I was making it pretty hacky. Surely I can set the NA cells with NaN (from numpy?) – Ryflex Dec 23 '13 at 06:51
  • Yes, that is exactly what I have done here. See the four lines starting with `from numpy import nan`. – mattexx Dec 23 '13 at 10:52
1
import sys

#Read the input file
input_data = [line.rstrip().split(",") for line in open("input.txt", 'r').readlines()]

#Put the value and index of each line into a dict,
#categorizing by the dataset/group name. 
#Each different dataset/group is a key of the dict,
#and each key's value is a list.
group_dict = {}
index = 0
for line in input_data:
    group_key = line[0]+","+line[1]
    if group_key not in group_dict.keys():
        group_dict[group_key] = []
    group_dict[group_key].append([index, line[2], None])
    index += 1

#Sort each list of the dict by the numbers.
#Make blank to be a very large number. 
for key in group_dict.keys():
    group_dict[key] = sorted(group_dict[key], key=lambda x: sys.maxint if x[1]=="" else int(x[1]))
    #####group_dict[key] = group_dict[key][::-1]
    ##### Uncomment the above line to sort in descending order  

#Check if there're multiple items with the same number, 
#If so, set them by the same rank.
    group_dict[key][0][2] = 1
    for i in range(1, len(group_dict[key])):
        group_dict[key][i][2] = (group_dict[key][i-1][2] if group_dict[key][i][1] == group_dict[key][i-1][1] else i+1)

#In order to keep the same line order with the input file, 
#get all the lists together into a new list, 
#and sort them by the line index (recorded when put them into the dict).
rank_list = []
for rank in group_dict.values():
    rank_list += rank
rank_list = sorted(rank_list, key=lambda x: x[0])
for rank in rank_list:
    input_data[rank[0]][4] = str(rank[2])

#Output the final list.
for line in input_data:
    print ",".join(line)

Test:

Input:

uniquedata1,uniquecell1,123,data,99,data
uniquedata1,uniquecell1,,data,99,data
uniquedata1,uniquecell1,111,data,99,data
uniquedata2,uniquecell2,456,data,99,data
uniquedata2,uniquecell2,,data,99,data
uniquedata2,uniquecell2,,data,99,data
uniquedata2,uniquecell2,789,data,99,data
uniquedata1,uniquecell2,386,data,99,data
uniquedata1,uniquecell2,512,data,99,data
uniquedata1,uniquecell2,486,data,99,data

Output:

uniquedata1,uniquecell1,123,data,2,data
uniquedata1,uniquecell1,,data,3,data
uniquedata1,uniquecell1,111,data,1,data
uniquedata2,uniquecell2,456,data,1,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,789,data,2,data
uniquedata1,uniquecell2,386,data,1,data
uniquedata1,uniquecell2,512,data,3,data
uniquedata1,uniquecell2,486,data,2,data  
Timothy
  • 4,467
  • 5
  • 28
  • 51
1

You really don't need two scripts for the task if the only difference is whether the ranking is to be done ascending or descending order -- just make it an argument to a function as shown. TheStrCount class is so trivial, it's probably wasn't worth the effort (but I've left it in).

import csv
from itertools import count, groupby
import sys

_MIN_INT, _MAX_INT = -sys.maxint-1, sys.maxint
RANK_DOWN, RANK_UP = False, True # larger numbers to get higher or lower rank

class StrCount(count):
    """ Like itertools.count iterator but supplies string values. """
    def next(self):
        return str(super(StrCount, self).next())

def rerank(filename, direction):
    with open(filename, 'rb') as inf:
        reader = csv.reader(inf)
        subst = _MIN_INT if direction else _MAX_INT  # subst value for empty cells
        for dataset, rows in groupby(reader, key=lambda row: row[:2]):
            ranking = StrCount(1)
            prev = last_rank = None
            for row in sorted(rows,
                              key=lambda row: int(row[2]) if row[2] else subst,
                              reverse=direction):
                row[4] = (ranking.next() if row[2] or not row[2] and prev != ''
                                         else last_rank)
                print ','.join(row)
                prev, last_rank  = row[2], row[4]

if __name__ == '__main__':
    print 'CSV example_1.csv (ranked down):'
    rerank('example_1.csv', RANK_DOWN)
    print '\nCSV example_2.csv (ranked up):'
    rerank('example_2.csv', RANK_UP)
    print '\nCSV example_3.csv (ranked up):'
    rerank('example_3.csv', RANK_UP)

The output:

CSV example_1.csv (ranked down):
uniquedata1,uniquecell1,13,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,42,data,3,data
uniquedata2,uniquecell2,22,data,1,data
uniquedata2,uniquecell2,39,data,2,data
uniquedata2,uniquecell2,41,data,3,data
uniquedata2,uniquecell2,45,data,4,data
uniquedata1,uniquecell2,36,data,1,data
uniquedata1,uniquecell2,40,data,2,data
uniquedata1,uniquecell2,66,data,3,data

CSV example_2.csv (ranked up):
uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,3,data
uniquedata2,uniquecell2,45,data,1,data
uniquedata2,uniquecell2,41,data,2,data
uniquedata2,uniquecell2,39,data,3,data
uniquedata2,uniquecell2,22,data,4,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data
uniquedata1,uniquecell2,36,data,3,data

CSV example_3.csv (ranked up):
uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,13,data,2,data
uniquedata1,uniquecell1,,data,3,data
uniquedata2,uniquecell2,41,data,1,data
uniquedata2,uniquecell2,22,data,2,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data
uniquedata1,uniquecell2,36,data,3,data
martineau
  • 119,623
  • 25
  • 170
  • 301
0

Python programmers typically use a list to sort data. There are several hurdles to writing your own.

  • Memory Constraints
  • Speed
  • Reading the file and Writing the new file
  • Applying multiple sort operations in the right order

Alternatively you can store the data in an sqlite database (simple file based database) and use a SQL query to pull out the data using sqlite3. This is arguably much easier for some folks, and may even be preferred in some cases.

Show us how have you tried to achieve your result and maybe we can help further.

Derek Litz
  • 10,529
  • 7
  • 43
  • 53
  • Memory constraints and speed aren't an issue as the files are only 5mb in size. I don't understand how the grouping works when it comes to the collections library, nor can I work out how to use the sorting function to rank blanks... – Ryflex Dec 19 '13 at 14:57
  • 1
    We want to see your code :), people here like fixing code that is gets close and shows effort. If you can't provide code you should read the docs on sorting lists. You can provide a function to replace default behavior, just like many other languages allow you to do (through sub classing or other means). – Derek Litz Dec 19 '13 at 15:08