Find ranges of Roman Numerals in SQL

Question

I am looking for a way to find the ranges in a SQL statement.

My data looks like this

         COUNTRY PROTEIN  MG
1          China    42.8  II
2          China    42.3 III
3          China    41.9 III
4  United States    40.0  IV
5          China    43.2   I
6          China    42.5  IV
7          China    42.9 III
8          China    45.9  VI
9          Japan    42.3  VI
10 United States    40.9 III

I am currently running a statement like this where I get the average for the protein values but I do not know how to go about the range of the roman numerals

select COUNTRY, avg(PROTEIN), MG from data group by COUNTRY

I would like my output to look like this (NOT CORRECT NUMBERS JUST AN EXAMPLE)

COUNTRY MEAN MG_RANGE
China 42.3 I-III
United States 45.2 I-VI

I have this file as a CSV so I am open to alternatives in R and Python that would give me the same output. It could be numerical as well whatever output is easiest.

why no just convert roman numerals to standard (arabic) ones, then you will be able to use any math sql functions? — Iłya Bursov, Oct 01 '14 at 00:14
Note that you've left out the desired output. Please add it. Would you want 3-4 or III-IV or (two columns: 'MG_MIN' and 'MG_MAX') with 3 and 4, or III and IV? — Jonathan Leffler, Oct 01 '14 at 00:37

PaulMcG · Answer 1 · 2014-10-01T04:15:24.257

Here is some code that will both parse and generate Roman numerals: http://pyparsing.wikispaces.com/file/view/romanNumerals.py/30112817/romanNumerals.py

Convert your input Roman numerals to integers (using romanNumeral.parseString from the linked code), compute the min and max values to get the range, then use makeRomanNumeral in the linked code to convert the range values back to Roman nums.

Or if you don't have that many different Roman values to deal with, just define a dict that maps the Roman numerals to their actual values, and you can skip using the parser. Something like this:

romanLookup = {
  'I' : 1,
  'II' : 2,
  'III' : 3,
  'IV' : 4,
  'V' : 5,
  'VI' : 6,
  'VII' : 7,
  'VIII' : 8,
  'IX' : 9,
  'X' : 10,
  'XI' : 11,
  'XII' : 12,
  'XIII' : 13,
  'XIV' : 14,
  'XV' : 15,
  'XVI' : 16,
  'XVII' : 17,
  'XVIII' : 18,
  'XIX' : 19,
  'XX' : 20,
}

Here is your program:

from itertools import groupby
from collections import namedtuple
DataRec = namedtuple("DataRec", "country protein mg")

#123456789012345678901234567890
# X             X       X   X
data = """\
1          China    42.8  II
2          China    42.3 III
3          China    41.9 III
4  United States    40.0  IV
5          China    43.2   I
6          China    42.5  IV
7          China    42.9 III
8          China    45.9  VI
9          Japan    42.3  VI
10 United States    40.9 III""".splitlines()

suppress=object()
def splitAt(cols,fns):
    last = 0
    slices = []
    for c in cols:
        slices.append(slice(last,c))
        last = c+1
    return lambda s: [fn(s[sl]) for sl,fn in zip(slices,fns) 
                        if fn is not suppress]
splitter = splitAt([2,16,24,28], 
                    [suppress, str.strip, float, str.strip])

recs = [DataRec(*splitter(d)) for d in data]

romanLookup = {
  'I' : 1,
  'II' : 2,
  'III' : 3,
  'IV' : 4,
  'V' : 5,
  'VI' : 6,
  'VII' : 7,
  'VIII' : 8,
  'IX' : 9,
  'X' : 10,
  'XI' : 11,
  'XII' : 12,
  'XIII' : 13,
  'XIV' : 14,
  'XV' : 15,
  'XVI' : 16,
  'XVII' : 17,
  'XVIII' : 18,
  'XIX' : 19,
  'XX' : 20,
}

# sort and group data by country
recs.sort(key=lambda x: x.country)
grouped = groupby(recs, key=lambda x: x.country)

# for each country group, compute average protein and min/max mg
for country,countryRecs in grouped:
    datatuples = list(countryRecs)
    mg_vals = [r.mg for r in datatuples]
    ave = sum(r.protein for r in datatuples)/len(datatuples)
    min_mg = min(mg_vals, key=romanLookup.__getitem__)
    max_mg = max(mg_vals, key=romanLookup.__getitem__)
    if min_mg == max_mg:
        print "%s, %.2f, %s" % (country, ave, min_mg)
    else:
        print "%s, %.2f, %s-%s" % (country, ave, min_mg, max_mg)

Prints:

China, 43.07, I-VI
Japan, 42.30, VI
United States, 40.45, III-IV

R also has the ability to go back and forth between Roman numerals loaded by default in the `utils` package. http://stackoverflow.com/questions/21116763/convert-roman-numerals-to-numbers-in-r — thelatemail, Oct 01 '14 at 00:23
Can you show me some code to convert and find the ranges though? — user2524994, Oct 01 '14 at 00:36
Actually, you don't even need to convert the ranges, just use `romanLookup.__getitem__` as the key function for getting the min and max MG values. — PaulMcG, Oct 01 '14 at 01:36
Note that reporting the min and max values as, for example with China, "I-IV" implies that *all* of the values from I to IV are present; but there is no guarantee that, just because I and IV exist that II and III will also exist. — PaulMcG, Oct 02 '14 at 16:46
Pyparsing is no longer hosted on wikispaces.com. Go to https://github.com/pyparsing/pyparsing — PaulMcG, Aug 27 '18 at 13:09

Rich Scriven · Accepted Answer · 2016-10-24T18:53:37.273

In R, the utils package has a non-exported function .roman2numeric(), for converting character Roman numerals to their respective numeric values. We can convert from Roman to numeric then aggregate to find the range within a country.

(dat$MG2 <- utils:::.roman2numeric(as.character(dat$MG))
#  2   3   3   4   1   4   3   6   6   3 
dat
#         COUNTRY PROTEIN  MG MG2
# 1         China    42.8  II   2
# 2         China    42.3 III   3
# 3         China    41.9 III   3
# 4  UnitedStates    40.0  IV   4
# 5         China    43.2   I   1
# 6         China    42.5  IV   4
# 7         China    42.9 III   3
# 8         China    45.9  VI   6
# 9         Japan    42.3  VI   6
# 10 UnitedStates    40.9 III   3

And to (arbitrarily) find the range of MG2 per country, we can do

aggregate(MG2 ~ COUNTRY, dat, range)
#        COUNTRY MG2.1 MG2.2
# 1        China     1     6
# 2        Japan     6     6
# 3 UnitedStates     3     4

Find ranges of Roman Numerals in SQL

2 Answers2