I have a Pandas DataFrame that has 2 columns: one for category (PROBLEM_TYPE) and one for the description of the category. Obviously, the description's value count is different for each category but since I'm planning on training a model based of this DataFrame, I'm trying to set the same value count for each category. That way my model would be more accurate.
DataFrame:
filtered_df.head()
PROBLEM_TYPE DESCRIPTION
0 problem_type1 blabla...
1 problem_type1 blablablabla...
2 problem_type3 bla...
3 problem_type7 blablabloblo...
4 problem_type2 blobloble...
This is what I get when I call the value_counts()
function:
filtered_df["PROBLEM_TYPE"].value_counts()
problem_type1 141887
problem_type2 21663
problem_type3 19349
problem_type4 15710
problem_type5 5229
problem_type6 5161
problem_type7 4682
problem_type8 3672
problem_type9 3296
problem_type10 3131
Name: PROBLEM_TYPE, dtype: int64
As you can see there a 10 different problem types (categories) and each of them have different value count. My question is how to set the same value count for all the problem types. For example, that each problem type would have the minimal value count in the DataFrame (which is 3131).
In the other words, how to set the frequency of each category to be the same so the next time I run the value_counts()
function it would look something like this:
filtered_df["PROBLEM_TYPE"].value_counts()
problem_type1 3131
problem_type2 3131
problem_type3 3131
problem_type4 3131
problem_type5 3131
problem_type6 3131
problem_type7 3131
problem_type8 3131
problem_type9 3131
problem_type10 3131
Name: PROBLEM_TYPE, dtype: int64
Thanks in advance.