0

Am trying to get the top most rating using groupby of multiple columns and if there is no combination of that particular groupby, its throwing me an error . how to do multiple combinations ?

data :

maritalstatus   gender age_range occ    rating
ma                 M    young   student  PG
ma                 F    adult   teacher  R
sin                M    young   student  PG
sin                M    adult   teacher  R
ma                 M    young   student  PG
sin                F    adult   teacher  R

code :

def get_top( maritalstatus, gender,age_range, occ):        
    m = df.groupby(['maritalstatus',' gender', 'age_range', 'occ'])
    ['rating'].apply(lambda x: x.value_counts().index[0 ])      
    mpaa = m[maritalstatus][gender][age_range][occ]    
    return mpaa

input :

get_top('ma', 'M', 'young','teacher)

output: throws me an error as there is no such combination.

Here if there is no such combination my function should limit to, married, male and young and not teacher as there is no such combination.

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
pylearner
  • 1,358
  • 2
  • 10
  • 26
  • Protip: Tag your pandas question with `pandas` if you want the right people to see it. – Aran-Fey Feb 10 '18 at 10:35
  • Thanks Aran for the tip. – pylearner Feb 10 '18 at 10:45
  • Do you need solution which should be working with `get_top('ma', 'M', 'teacher)` or `get_top('M', 'young','teacher)` or `get_top('ma', 'teacher)` ? – jezrael Feb 10 '18 at 18:39
  • first, it should look at a sequence, get_top('ma', 'm', 'young', 'teacher'), if its not there then get_top('ma', 'm', 'young') if this isnt there, then, get_top('ma', 'm') – pylearner Feb 10 '18 at 18:43
  • What is your expected output? – pylang Feb 11 '18 at 01:53
  • @pylang : input : get_top('ma', 'm', 'young', 'teacher') , using all the for args, it should work on all kinds of combinations and return outputs for the combinations which exist . – pylearner Feb 11 '18 at 04:40
  • For clarity, you should post exactly the output you expect. My post has an application for valid inputs. You can wrap around that last function to memoize the last result and revert if a `KeyError` is observed. – pylang Feb 11 '18 at 04:49

3 Answers3

0

This is one non-pandas solution. Counter.most_common() orders results by most common descending counts.

from collections import Counter

def get_top(maritalstatus=None, gender=None, age_range=None, occ=None):

    cols = ['maritalstatus', 'gender', 'age_range', 'occ']
    values = [maritalstatus, gender, age_range, occ]

    c = Counter(df.query(' & '.join((('({0} ==  "{1}")').format(i, j)) \
                for i, j in zip(cols, values) if j))['rating'])

    return c.most_common()

get_top(maritalstatus='ma', gender='M', age_range='young')  # [('PG', 2)]
jpp
  • 159,742
  • 34
  • 281
  • 339
  • what if I dont have that combination? suppose, it stopped at marital status, gender and age range ..it should only return me until there – pylearner Feb 10 '18 at 10:45
  • @pylearner, see update. `q` is an arbitrary list of tuples, which can be fed in from a function. – jpp Feb 10 '18 at 10:51
  • can use that multiple columns ? – pylearner Feb 10 '18 at 10:57
  • yea, so how do I list of columns to q ? say, a,b,c,d are my columns... in a sequence i can only use a,b,c as a,b,c,d doesnt combine. ..so q should only have a,b,c ...how do i get it ? – pylearner Feb 10 '18 at 11:36
  • see update, in the function just include the variables you are searching for – jpp Feb 10 '18 at 11:45
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164879/discussion-between-pylearner-and-jp-data-analysis). – pylearner Feb 10 '18 at 11:53
  • to be honest, i'd rather you define your input & expected output in the question. be explicit and give an example. – jpp Feb 10 '18 at 11:59
  • If you see my input in the question above, I have given some inputs of a new user, but in the data i havent had a teacher who is young ... so my function should only combine 'ma','m' and 'young' – pylearner Feb 10 '18 at 12:07
  • @pylearner, yes. and i've given a solution where you can input between 1 & 4 parameters. – jpp Feb 10 '18 at 12:27
0

You can use *args for dynamic input, (ordering of values cannot be changed) with query for filtering:

def get_top(*args):     
    c = ['maritalstatus', 'gender', 'age_range', 'occ']
    m = (df.groupby(c)['rating'].apply(lambda x: x.value_counts().index[0])
           .reset_index())
    args = list(args)
    while True:
        d = dict(zip(c, args))
        #https://stackoverflow.com/a/48371587/2901002
        q = ' & '.join((('({} ==  "{}")').format(i, j)) for i, j in d.items())
        m1 = m.query(q)['rating']  
        if m1.empty and len(args) > 1:
            args.pop()
        else:
            return m1

print(get_top('ma', 'M', 'young','teacher'))
1    PG
Name: rating, dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • d = dict(zip(cols, args)), I dont see cols anywhere, can you let me know where did you take it from ? – pylearner Feb 10 '18 at 19:41
  • oops, it is `c` – jezrael Feb 10 '18 at 19:42
  • Series([], Name: rating, dtype: object), thats my output. – pylearner Feb 10 '18 at 19:59
  • With sample data? – jezrael Feb 10 '18 at 20:01
  • Can you test `df = pd.DataFrame({'maritalstatus': ['ma', 'ma', 'sin', 'sin', 'ma', 'sin'], 'gender': ['M', 'F', 'M', 'M', 'M', 'F'], 'age_range': ['young', 'adult', 'young', 'adult', 'young', 'adult'], 'occ': ['student', 'teacher', 'student', 'teacher', 'student', 'teacher'], 'rating': ['PG', 'R', 'PG', 'R', 'PG', 'R']})` ? – jezrael Feb 10 '18 at 20:02
  • But if test `get_top('ma', 'M', 'young','teacher)` it return noting, because not such data in Dataframe. – jezrael Feb 10 '18 at 20:03
  • yea, so if this _get_top('ma', 'M', 'young','teacher)_ is not returning anything, its should perform this _get_top('ma', 'm', 'young')_ ... automatically – pylearner Feb 10 '18 at 20:09
0

pandas is definitely the goto library for handling detailed tabular data. For those seeking a non-pandas option, you can build your own mapping and reduction functions. I use these terms to mean the following:

  • mapping: reorganize data grouped by a desired query
  • reduction: an aggregation function, used to tally or condense many values to one

pandas analogous groupby/aggregation concepts.

Given

Cleaned data where multiple spaces have been replaced with a single delimiter, e.g. ",".

%%file "test.txt"
status,gender,age_range,occ,rating
ma,M,young,student,PG
ma,F,adult,teacher,R
sin,M,young,student,PG
sin,M,adult,teacher,R
ma,M,young,student,PG
sin,F,adult,teacher,R

Code

import csv
import collections as ct

Step 1: Read data

def read_file(fname):
    with open(fname, "r") as f:
        reader = csv.DictReader(f)
        for line in reader:
            yield line


iterable = [line for line in read_file("test.txt")]
iterable

Output

[OrderedDict([('status', 'ma'),
              ('gender', 'M'),
              ('age_range', 'young'),
              ('occ', 'student'),
              ('rating', 'PG')]),
 OrderedDict([('status', 'ma'),
              ('gender', 'F'),
              ('age_range', 'adult'),
              ...]
 ...
] 

Step 2: Remap data

def mapping(data, column):
    """Return a dict of regrouped data."""
    dd = ct.defaultdict(list)
    for d in data:
        key = d[column]
        value = {k: v for k, v in d.items() if k != column}
        dd[key].append(value)
    return dict(dd)


mapping(iterable, "gender")

Output

{'M': [
   {'age_range': 'young', 'occ': 'student', 'rating': 'PG', ...},
   ...]
 'F': [
   {'status': 'ma', 'age_range': 'adult', ...},
   ...]
} 

Step 3: Reduce data

def reduction(data):
    """Return a reduced mapping of Counters."""
    final = {}
    for key, val in data.items():
        agg = ct.defaultdict(ct.Counter)
        for d in val:
            for k, v in d.items():
                agg[k][v] += 1
        final[key] = dict(agg)
    return final

reduction(mapping(iterable, "gender"))

Output

{'F': {
   'age_range': Counter({'adult': 2}),
   'occ': Counter({'teacher': 2}),
   'rating': Counter({'R': 2}),
   'status': Counter({'ma': 1, 'sin': 1})},
 'M': {
   'age_range': Counter({'adult': 1, 'young': 3}),
   'occ': Counter({'student': 3, 'teacher': 1}),
   'rating': Counter({'PG': 3, 'R': 1}),
   'status': Counter({'ma': 2, 'sin': 2})}
 }

Demo

With these tools in place, you can build a data pipeline and to query the data, feeding results from one function into another:

# Find the top age range amoung males
pipeline = reduction(mapping(iterable, "gender"))
pipeline["M"]["age_range"].most_common(1)
# [('young', 3)]

# Find the top ratings among teachers
pipeline = reduction(mapping(iterable, "occ"))
pipeline["teacher"]["rating"].most_common()
# [('R', 3)]

# Find the number of married people
pipeline = reduction(mapping(iterable, "gender"))
sum(v["status"]["ma"] for k, v in pipeline.items())
# 3

Overall, you tailor your output based on how you define your reduction function.

Note, the code from this generalized process is more verbose than a former example despite its powerful application to many data columns. pandas succinctly encapsulates these concepts. Although the learning curve may initially be more steep, it can greatly expedite data analysis.


Details

  1. Read data - we parse each line of a cleaned file using csv.DictReader, which maintains the header names as keys of a dictionary. This structure facilitates easier column access by name.
  2. Remap data - we group data as a dictionary.
    • The keys are items in the selected/queried column, e.g. "M", "F".
    • The values are each a list of dictionaries. Each dictionary represents a row of all remaining columnar data (excluding the key).
  3. Reduce data - we aggregate the values of the remapped data by tabulating related entries for all listed dictionaries. Together the defaultdict and Counter combination build an excellent reducing data structure where new entries to the defaultdict initialize a Counter and repeated entries simply tally observations.

Application

Pipelines are optional. Here we will build a single function that processes serial requests:

def serial_reduction(iterable, val_queries):
    """Return a `Counter` that is reduced after serial queries."""
    q1, *qs = val_queries 
    val_to_key = {v:k for k, v in iterable[0].items()}
    values_list = mapping(iterable, val_to_key[q1])[q1]

    counter = ct.Counter()
    # Process queries for dicts in each row and build a counter
    for q in qs:    
        try:
            for row in values_list[:]:
                if val_to_key[q] not in row:
                    continue
                else:
                    reduced_vals = {v for v in row.values() if v not in qs}
            for val in reduced_vals:
                counter[val] += 1
        except KeyError:
            raise ValueError("'{}' not found. Try a new query.".format(q))
    return counter


c = serial_reduction(iterable, "ma M young".split())
c.most_common()
# [('student', 2), ('PG', 2)]
serial_reduction(iterable, "ma M young teacher".split())
# ValueError: 'teacher' not found. Try a new query.
pylang
  • 40,867
  • 14
  • 129
  • 121