1

I have a Pandas DataFrame that has 2 columns: one for category (PROBLEM_TYPE) and one for the description of the category. Obviously, the description's value count is different for each category but since I'm planning on training a model based of this DataFrame, I'm trying to set the same value count for each category. That way my model would be more accurate.

DataFrame:

filtered_df.head()

    PROBLEM_TYPE    DESCRIPTION
0   problem_type1   blabla...
1   problem_type1   blablablabla...
2   problem_type3   bla...
3   problem_type7   blablabloblo...
4   problem_type2   blobloble...

This is what I get when I call the value_counts() function:

filtered_df["PROBLEM_TYPE"].value_counts()

problem_type1            141887
problem_type2             21663
problem_type3             19349
problem_type4             15710
problem_type5              5229
problem_type6              5161
problem_type7              4682
problem_type8              3672
problem_type9              3296
problem_type10             3131
Name: PROBLEM_TYPE, dtype: int64

As you can see there a 10 different problem types (categories) and each of them have different value count. My question is how to set the same value count for all the problem types. For example, that each problem type would have the minimal value count in the DataFrame (which is 3131).

In the other words, how to set the frequency of each category to be the same so the next time I run the value_counts() function it would look something like this:

filtered_df["PROBLEM_TYPE"].value_counts()

problem_type1              3131
problem_type2              3131
problem_type3              3131
problem_type4              3131
problem_type5              3131
problem_type6              3131
problem_type7              3131
problem_type8              3131
problem_type9              3131
problem_type10             3131
Name: PROBLEM_TYPE, dtype: int64

Thanks in advance.

Makintoy
  • 23
  • 1
  • 5

1 Answers1

0

You can do the following:

mi = df["PROBLEM_TYPE"].value_counts().min()
df_filtered = df.sample(frac=1).groupby("PROBLEM_TYPE").head(mi)

There are two steps in this solution, find the "PROBLEM_TYPE" with minimum count:

mi = df["PROBLEM_TYPE"].value_counts().min()

Then shuffle your dataframe rows and take the first mi of each group:

df_filtered = df.sample(frac=1).groupby("PROBLEM_TYPE").head(mi)

See here for a discussion on how to shuffle a dataframe.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76