0

I have the following dataframe consisting of UserId and the Name of the badge earned by that person on Stackoverflow. Now, each badge belongs to a particular category such as Question, Answer, Participation, Moderation and Tag. I want to create a column called Category to store the category of each badge.

The code that I have written works well if data is less than 1M users, for more data it just keeps loading. How to fix this?

Dataframe (badges)

UserId | Name
  1    | Altruist
  2    | Autobiographer
  3    | Enlightened
  4    | Citizen Patrol
  5    | python

Code

def category(df):  
  
  questionCategory = ['Altruist', 'Benefactor', 'Curious', 'Inquisitive', 'Socratic', 'Favorite Question', 'Stellar Question', 'Investor', 'Nice Question', 'Good Question', 'Great Question', 'Popular Question', 'Notable Question', 'Famous Question', 'Promoter', 'Scholar', 'Student']
  
  answerCategory = ['Enlightened', 'Explainer', 'Refiner', 'Illuminator', 'Generalist', 'Guru', 'Lifejacket', 'Lifeboat', 'Nice Answer', 'Good Answer', 'Great Answer', 'Populist', 'Revival', 'Necromancer', 'Self-Learner','Teacher', 'Tenacious', 'Unsung Hero']
  
  participationCategory = ['Autobiographer','Caucus', 'Constituent', 'Commentator', 'Pundit', 'Enthusiast', 'Fanatic', 'Mortarboard', 'Epic', 'Legendary', 'Precognitive', 'Beta', 'Quorum', 'Convention', 'Talkative', 'Outspoken', 'Yearling']
  
  moderationCategory = ['Citizen Patrol', 'Deputy', 'Marshal', 'Civic Duty', 'Cleanup', 'Constable', 'Sheriff', 'Critic', 'Custodian', 'Reviewer', 'Steward', 'Disciplined', 'Editor', 'Strunk & White', 'Copy Editor', 'Electorate', 'Excavator', 'Archaelogist', 'Organizer', 'Peer Pressure', 'Proofreader', 'Sportsmanship', 'Suffrage', 'Supporter', 'Synonymizer', 'Tag Editor', 'Research Assistant', 'Taxonomist', 'Vox Populi']

  #Tag Category will be represented as 0
  df['Category'] = 0

  for i in range(len(df)) : 
    if (df.loc[i, "Name"] in questionCategory):
      df.loc[i, 'Category'] = 1 

    elif (df.loc[i, "Name"] in answerCategory):
      df.loc[i, 'Category'] = 2 

    elif (df.loc[i, "Name"] in participationCategory):
      df.loc[i, 'Category'] = 3 

    elif (df.loc[i, "Name"] in moderationCategory):
      df.loc[i, 'Category'] = 4 

  return df   

category(stackoverflow_badges)

Expected Output

UserId | Name           | Category
  1    | Altruist       |    1
  2    | Autobiographer |    3
  3    | Enlightened    |    2
  4    | Citizen Patrol |    4
  5    | python         |    0
Ishan Dutta
  • 897
  • 4
  • 16
  • 36

1 Answers1

1

If you want to update a dataframe with more than 1M rows, than you definetely want to avoid for loops whenever possible. There is an easier to update your 'Category' column, like it was done here.

In your case, you just need to convert your 4 lists with the badges names to a dictionary matching the badge name to its numerical category, like:

category_dict = {
    **{key: 1 for key in questionCategory},
    **{key: 2 for key in answerCategory},
    **{key: 3 for key in participationCategory},
    **{key: 4 for key in moderationCategory},
}

And then you can replace all your for loops for this command:

df['Category'] = df['Name'].map(category_dict).fillna(0)

This may not solve your whole issue, but at least will save some time.

Ralubrusto
  • 1,394
  • 2
  • 11
  • 24
  • This woeks correctly but has 1 issue. There is another category called `Tag` which has to be set as `0` if the badge does not belong to the other 4 categories. Can you add it to the code as well? – Ishan Dutta Nov 12 '20 at 03:42
  • Sorry, I've read it in your question but I forgot to put it in the code. Just add `.fillna(0)`, since not matching values will be set to `nan` by default. – Ralubrusto Nov 12 '20 at 04:06