I have a pandas dataframe which looks like this:
{
'entry': string,
'category': string,
'cat_1_score': float,
'cat_2_score': float,
'cat_3_score': float
}
A piece of this dataframe:
**entry** **category**
Entry161 cat_1
Entry37 cat_2; cat_1
Entry209 cat_1
Entry161 cat_1
Entry161 cat_1
Entry46 cat_1
Entry47 cat_3; cat_1
Entry37 cat_2; cat_1
Entry75 cat_1
Entry161 cat_1
Reproducible:
df = pd.DataFrame([['Entry161', 'cat_1'], ['Entry137', 'cat_2; cat_1'], ['Entry209', 'cat_1'], ['Entry161', 'cat_1'], ['Entry161', 'cat_1'], ['Entry46', 'cat_1'], ['Entry47', 'cat_3; cat_1'], ['Entry37', 'cat_2; cat_1'], ['Entry75', 'cat_1'], ['Entry161', 'cat_1']], columns=['entry', 'category'])
and I want to create a Python Counter
Object that will be a nested dictionary of the form {'entry': {'category': count}}
. Values of entry
are non-unique, and values of category
can be one or more of cat_1, cat_2, cat_3
. For example, in the first row, since there is only cat_1
, the value for cat_1_score
will be greater than zero, and the values for cat_2_score
and cat_3_score
will be zero.
In the example above, the Python Counter should look like:
{'Entry161': {'cat_1': 4}, 'Entry37': {'cat_1':2, 'cat_2': 2}...}
In the case that category
only takes single values, the straightforward, albeit inefficient way of doing this would be iterating over the rows of the DataFrame like this:
my_counter = defaultdict(Counter)
for i, row in df.iterrows():
my_counter[row.entry][row.category] += 1
Depending on the value of category
(one of cat_1
, cat_2
, cat_3
) the appropriate _score
is nonzero. So there has to be faster way to generate these counters without iterating over rows of the dataframe, and using vectorization instead.
Unfortunately this gets more complicated if there is the option of having multiple entries in the category
column, separated by a semi colon, for example one row can have cat_1; cat_3
as an entry to this column, and the counter needs to take into account both cases. In that case, my way to work around it would be:
for i, row in df.iterrows():
for cat in ['cat_1', 'cat_2', 'cat_3']:
if cat in row.category:
my_counter[row.entry][cat] += 1
Is there a way to substitute these two code snippets with a more pandas-preferred way?