1

I have two columns like shown below, and trying to return the highest count of the second column, but its just returning me the highest count on rating without considering the gender

DATA :

print (df)

   AGE GENDER rating
0   10      M     PG
1   10      M      R
2   10      M      R
3    4      F   PG13
4    4      F   PG13

CODE :

 s = (df.groupby(['AGE', 'GENDER'])['rating']
       .apply(lambda x: x.value_counts().head(2))
       .rename_axis(('a','b', 'c'))
       .reset_index(level=2)['c'])

OUTPUT :

print (s[F])
('PG')

print(s[M]

('PG', 'R')
pylearner
  • 1,358
  • 2
  • 10
  • 26

2 Answers2

2

Here is a standard library solution for this file:

%%file "test.txt"
gender  rating
M   PG
M   R
F   NR
M   R
F   PG13
F   PG13

Given

import collections as ct


def read_file(fname):
    with open(fname, "r") as f:
        header = next(f)
        for line in f:
            gender, rating = line.strip().split()
            yield gender, rating

Code

filename = "test.txt"

dd = ct.defaultdict(ct.Counter)
for k, v in sorted(read_file(filename), key=lambda x: x[0]):
    dd[k][v] += 1 

{k: v.most_common(1) for k, v in dd.items()}
# {'F': [('PG13', 2)], 'M': [('R', 2)]}

Details

Each line of the file is parse and added to a defaultdict. The keys are genders, but the values are Counter objects for each rating per gender. Counter.most_common() is called to retrieve the top occurrences.

Since the data is grouped by gender, you can explore more information. For example, unique ratings of each gender:

{k: set(v.elements()) for k, v in dd.items()}
# {'F': {'NR', 'PG13'}, 'M': {'PG', 'R'}}
pylang
  • 40,867
  • 14
  • 129
  • 121
  • hey what if I have an additional column say age_range which have values like, 'young', 'adult', so i want the top most from young and male in a combination. – pylearner Feb 10 '18 at 10:36
  • This code would have to be modified to handle extra columns. If you post a new question, I could address it. I will leave this answer as is. – pylang Feb 10 '18 at 18:04
  • https://stackoverflow.com/questions/48719674/how-to-use-different-combinations-of-group-by-while-trying-to-get-the-top-most-v ...can you look in here – pylearner Feb 10 '18 at 18:31
1

I think you need for counts with categories and ratings use groupby + value_counts + head:

df1 = (df.groupby('gender')['rating']
         .apply(lambda x: x.value_counts().head(1))
         .rename_axis(('gender','rating'))
         .reset_index(name='val'))
print (df1)
  gender rating  val
0      F   PG13    2
1      M      R    2

If want only top ratings seelct first value of index per group:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])
print (s)
gender
F    PG13
M       R
Name: rating, dtype: object

print (s['M'])
R
print (s['F'])
PG13

Or only top counts select first value of Series per group:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().iat[0])
print (s)
gender
F    2
M    2
Name: rating, dtype: int64

print (s['M'])
2
print (s['F'])
2

EDIT:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])

def gen_mpaa(gender):
    return s[gender]

print (gen_mpaa('M'))

print (gen_mpaa('F'))

EDIT:

Solution if genre id values are strings:

print (type(df.loc[0, 'genre id']))
<class 'str'>

df = df.set_index('gender')['genre id'].str.split(',', expand=True).stack()
print (df)
gender   
M       0    11
        1    22
        2    33
        0    22
        1    44
        2    55
        0    33
        1    44
        2    55
F       0    11
        1    22
        0    22
        1    55
        0    55
        1    44
dtype: object

d = df.groupby(level=0).apply(lambda x: x.value_counts().index[0]).to_dict()
print (d)
{'M': '55', 'F': '55'}

EDIT1:

print (df)
   AGE GENDER rating
0   10      M     PG
1   10      M      R
2   10      M      R
3    4      F   PG13
4    4      F   PG13

s = (df.groupby(['AGE', 'GENDER'])['rating']
       .apply(lambda x: x.value_counts().head(2))
       .rename_axis(('a','b', 'c'))
       .reset_index(level=2)['c'])
print (s)

a   b
4   F    PG13
10  M       R
    M      PG
Name: c, dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • when I insert this " s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0]) " and returning s... its throwing me an error. I just need to send gender as my input for the function and it should give me the most frequent rating directly – pylearner Feb 08 '18 at 09:31
  • its just giving me the rating as same for both the genders, like F- pg13 and M -PG13 – pylearner Feb 08 '18 at 09:33
  • Hmm, maybe same number of top, you can check it by `print(df.groupby('gender')['rating'].value_counts())` – jezrael Feb 08 '18 at 09:35
  • This is my input gen_mpaa('F') , the out put is GENDERCODE F PG13 M PG13 U PG13 – pylearner Feb 08 '18 at 09:35
  • ok, but it should just return the rating of 'F' when given 'F' as my input – pylearner Feb 08 '18 at 09:36
  • I am a bit confused, what is problem? Wrong output - wrong returned strings? Or cannot return only one string? – jezrael Feb 08 '18 at 09:38
  • when I input for only 'F' it should just return me the highest count rating of 'F' – pylearner Feb 08 '18 at 09:42
  • OK, what is expected output? – jezrael Feb 08 '18 at 09:42
  • input : gen_mpa('F'), output : PG13 – pylearner Feb 08 '18 at 09:43
  • here we are returning s[gender], what if I want to return multiple parameters ? I have age_range as another parameter for my function gen_mpaa, and I want to return that as well. How do I do it? – pylearner Feb 09 '18 at 07:33
  • Hmmm, I am now a bit confused. You unaccept my solution so the best is ask this `pylang`. Also the best is modify solution with input data and expected output. – jezrael Feb 09 '18 at 07:38
  • hey I got it, I just used s[gender][age_range] – pylearner Feb 09 '18 at 07:40
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164809/discussion-between-pylearner-and-jezrael). – pylearner Feb 09 '18 at 07:59
  • hey @jezrael .. https://stackoverflow.com/questions/48719674/how-to-use-different-combinations-of-group-by-while-trying-to-get-the-top-most-v ....can you answer this ? – pylearner Feb 10 '18 at 18:29
  • I can check it. – jezrael Feb 10 '18 at 18:29
  • Thanks, will be waiting for your reply – pylearner Feb 10 '18 at 18:32
  • hey you mentioned this in the above code **s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])** what if I give a specific value to the gender say `s = df.groupby('MALE')['rating'].apply(lambda x: x.value_counts().index[0])` its throwing me an error ... what to do incase of that ? – pylearner Feb 12 '18 at 07:36
  • it just says **KeyError: 'MALE'** – pylearner Feb 12 '18 at 07:41
  • It means there is no column `MALE`. if need filter values in `gender` by `MALE` need `s = df.query("gender == 'MALE'").groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])` – jezrael Feb 12 '18 at 07:42
  • in that case if i have multiple groupby's, **s = df.groupby(['MALE','YOUNG])['rating'].apply(lambda x: x.value_counts().index[0])** – pylearner Feb 12 '18 at 07:46
  • then change `df.query("gender" == 'MALE')` to `df[df["gender"].isin(['MALE','YOUNG'])]` – jezrael Feb 12 '18 at 07:47
  • hey can you look at this, I have briefly mentioned it. https://stackoverflow.com/questions/48741867/how-to-use-columns-values-to-groupby – pylearner Feb 12 '18 at 08:04
  • Am using this code **s = df.groupby(['MALE', 'GENDER'])['rating'].apply(lambda x: x.value_counts().index[0])** to get the top most frequent as my index is 0. As I need 2nd top most also, I have set the index to 1. But in my data I dont have the second top most frequent It throwed me an error. how to handle it ? – pylearner Feb 13 '18 at 12:26
  • What do you need happen if no second top? Add `NaN`? Can you explain more? – jezrael Feb 13 '18 at 12:27
  • It should only return me the top most, but not error – pylearner Feb 13 '18 at 12:30
  • I dont understand, do you need instaed top2 if not exist return top1? – jezrael Feb 13 '18 at 12:31
  • I need outputs for both, but when there is only top 1 and not top2, it should return only top 1 – pylearner Feb 13 '18 at 12:32
  • You need `s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().head(2)).rename_axis(('a','b')).reset_index(level=1)['b']` – jezrael Feb 13 '18 at 12:37
  • am doing a group by using ** age and gender**, say I have combinations of ** male and young ** , and that combination has only one rating pg13, so it should only return me pg13, as there is no top 2 – pylearner Feb 13 '18 at 12:44
  • 1
    Jez, I guess this will solve my problem, instead of this **10 M R M PG** as my output, cant it return `R and PG` if top 1 and top2 are there, and `PG` if only top1 is there, --- ('R', 'PG') if two, ('PG') if only top1 – pylearner Feb 13 '18 at 13:06
  • And yes your edit is working ... just need the output as explained above – pylearner Feb 13 '18 at 13:08
  • Can you add expected output to question? Because bad formating of comments. Thanks. – jezrael Feb 13 '18 at 13:08
  • jez, you can checkout my output in the question. I have edited it accordingly – pylearner Feb 13 '18 at 13:22
  • I dont understand why for `F` is no second value. – jezrael Feb 13 '18 at 13:35
  • As my dataset has only one top most for that sequence . Is there a way to get that ? – pylearner Feb 13 '18 at 13:46
  • Yes, it is possible get top. Why I dont understand why is for `F` not output `('PG', 'NR')` – jezrael Feb 13 '18 at 13:47
  • The output I mentioned is for the dataset, you used in edit 1, and not my dataset – pylearner Feb 13 '18 at 13:49
  • I think output from your dataset. – jezrael Feb 13 '18 at 13:49
  • Ok, you can use your dataset and get me the ouput I expected to be – pylearner Feb 13 '18 at 13:50
  • No, I dont understand what you need. :( Why is not `('PG', 'NR')` for `F`. – jezrael Feb 13 '18 at 13:52
  • @pylearner - Can you create new question? Because now answers are different as question :( – jezrael Feb 13 '18 at 13:56
  • sure, in a min. – pylearner Feb 13 '18 at 13:58
  • https://stackoverflow.com/questions/48768632/printing-the-top-2-of-frequently-occurred-values-of-the-target-column – pylearner Feb 13 '18 at 14:03
  • can you check that – pylearner Feb 13 '18 at 14:03