1

I am looking for a way to find the ranges in a SQL statement.

My data looks like this

         COUNTRY PROTEIN  MG
1          China    42.8  II
2          China    42.3 III
3          China    41.9 III
4  United States    40.0  IV
5          China    43.2   I
6          China    42.5  IV
7          China    42.9 III
8          China    45.9  VI
9          Japan    42.3  VI
10 United States    40.9 III

I am currently running a statement like this where I get the average for the protein values but I do not know how to go about the range of the roman numerals

select COUNTRY, avg(PROTEIN), MG from data group by COUNTRY

I would like my output to look like this (NOT CORRECT NUMBERS JUST AN EXAMPLE)

COUNTRY MEAN MG_RANGE
China 42.3 I-III
United States 45.2 I-VI

I have this file as a CSV so I am open to alternatives in R and Python that would give me the same output. It could be numerical as well whatever output is easiest.

user2524994
  • 357
  • 1
  • 5
  • 13

2 Answers2

2

Here is some code that will both parse and generate Roman numerals: http://pyparsing.wikispaces.com/file/view/romanNumerals.py/30112817/romanNumerals.py

Convert your input Roman numerals to integers (using romanNumeral.parseString from the linked code), compute the min and max values to get the range, then use makeRomanNumeral in the linked code to convert the range values back to Roman nums.

Or if you don't have that many different Roman values to deal with, just define a dict that maps the Roman numerals to their actual values, and you can skip using the parser. Something like this:

romanLookup = {
  'I' : 1,
  'II' : 2,
  'III' : 3,
  'IV' : 4,
  'V' : 5,
  'VI' : 6,
  'VII' : 7,
  'VIII' : 8,
  'IX' : 9,
  'X' : 10,
  'XI' : 11,
  'XII' : 12,
  'XIII' : 13,
  'XIV' : 14,
  'XV' : 15,
  'XVI' : 16,
  'XVII' : 17,
  'XVIII' : 18,
  'XIX' : 19,
  'XX' : 20,
}

Here is your program:

from itertools import groupby
from collections import namedtuple
DataRec = namedtuple("DataRec", "country protein mg")

#123456789012345678901234567890
# X             X       X   X
data = """\
1          China    42.8  II
2          China    42.3 III
3          China    41.9 III
4  United States    40.0  IV
5          China    43.2   I
6          China    42.5  IV
7          China    42.9 III
8          China    45.9  VI
9          Japan    42.3  VI
10 United States    40.9 III""".splitlines()

suppress=object()
def splitAt(cols,fns):
    last = 0
    slices = []
    for c in cols:
        slices.append(slice(last,c))
        last = c+1
    return lambda s: [fn(s[sl]) for sl,fn in zip(slices,fns) 
                        if fn is not suppress]
splitter = splitAt([2,16,24,28], 
                    [suppress, str.strip, float, str.strip])

recs = [DataRec(*splitter(d)) for d in data]

romanLookup = {
  'I' : 1,
  'II' : 2,
  'III' : 3,
  'IV' : 4,
  'V' : 5,
  'VI' : 6,
  'VII' : 7,
  'VIII' : 8,
  'IX' : 9,
  'X' : 10,
  'XI' : 11,
  'XII' : 12,
  'XIII' : 13,
  'XIV' : 14,
  'XV' : 15,
  'XVI' : 16,
  'XVII' : 17,
  'XVIII' : 18,
  'XIX' : 19,
  'XX' : 20,
}

# sort and group data by country
recs.sort(key=lambda x: x.country)
grouped = groupby(recs, key=lambda x: x.country)

# for each country group, compute average protein and min/max mg
for country,countryRecs in grouped:
    datatuples = list(countryRecs)
    mg_vals = [r.mg for r in datatuples]
    ave = sum(r.protein for r in datatuples)/len(datatuples)
    min_mg = min(mg_vals, key=romanLookup.__getitem__)
    max_mg = max(mg_vals, key=romanLookup.__getitem__)
    if min_mg == max_mg:
        print "%s, %.2f, %s" % (country, ave, min_mg)
    else:
        print "%s, %.2f, %s-%s" % (country, ave, min_mg, max_mg)

Prints:

China, 43.07, I-VI
Japan, 42.30, VI
United States, 40.45, III-IV
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • 1
    R also has the ability to go back and forth between Roman numerals loaded by default in the `utils` package. http://stackoverflow.com/questions/21116763/convert-roman-numerals-to-numbers-in-r – thelatemail Oct 01 '14 at 00:23
  • Can you show me some code to convert and find the ranges though? – user2524994 Oct 01 '14 at 00:36
  • Actually, you don't even need to convert the ranges, just use `romanLookup.__getitem__` as the key function for getting the min and max MG values. – PaulMcG Oct 01 '14 at 01:36
  • Ok But how can I get the ranges from those values? – user2524994 Oct 01 '14 at 01:50
  • Note that reporting the min and max values as, for example with China, "I-IV" implies that *all* of the values from I to IV are present; but there is no guarantee that, just because I and IV exist that II and III will also exist. – PaulMcG Oct 02 '14 at 16:46
  • Pyparsing is no longer hosted on wikispaces.com. Go to https://github.com/pyparsing/pyparsing – PaulMcG Aug 27 '18 at 13:09
2

In R, the utils package has a non-exported function .roman2numeric(), for converting character Roman numerals to their respective numeric values. We can convert from Roman to numeric then aggregate to find the range within a country.

(dat$MG2 <- utils:::.roman2numeric(as.character(dat$MG))
#  2   3   3   4   1   4   3   6   6   3 
dat
#         COUNTRY PROTEIN  MG MG2
# 1         China    42.8  II   2
# 2         China    42.3 III   3
# 3         China    41.9 III   3
# 4  UnitedStates    40.0  IV   4
# 5         China    43.2   I   1
# 6         China    42.5  IV   4
# 7         China    42.9 III   3
# 8         China    45.9  VI   6
# 9         Japan    42.3  VI   6
# 10 UnitedStates    40.9 III   3

And to (arbitrarily) find the range of MG2 per country, we can do

aggregate(MG2 ~ COUNTRY, dat, range)
#        COUNTRY MG2.1 MG2.2
# 1        China     1     6
# 2        Japan     6     6
# 3 UnitedStates     3     4
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245