0

I have a pandas dataframe which looks like this:

{
   'entry': string,
   'category': string,
   'cat_1_score': float,
   'cat_2_score': float,
   'cat_3_score': float
}

A piece of this dataframe:

**entry**             **category**
Entry161              cat_1
Entry37               cat_2; cat_1
Entry209              cat_1
Entry161              cat_1
Entry161              cat_1
Entry46               cat_1
Entry47               cat_3; cat_1
Entry37               cat_2; cat_1
Entry75               cat_1
Entry161              cat_1

Reproducible:

df = pd.DataFrame([['Entry161', 'cat_1'], ['Entry137', 'cat_2; cat_1'], ['Entry209', 'cat_1'], ['Entry161', 'cat_1'], ['Entry161', 'cat_1'], ['Entry46', 'cat_1'], ['Entry47', 'cat_3; cat_1'], ['Entry37', 'cat_2; cat_1'], ['Entry75', 'cat_1'], ['Entry161', 'cat_1']], columns=['entry', 'category'])

and I want to create a Python Counter Object that will be a nested dictionary of the form {'entry': {'category': count}}. Values of entry are non-unique, and values of category can be one or more of cat_1, cat_2, cat_3. For example, in the first row, since there is only cat_1, the value for cat_1_score will be greater than zero, and the values for cat_2_score and cat_3_score will be zero.

In the example above, the Python Counter should look like:

{'Entry161': {'cat_1': 4}, 'Entry37': {'cat_1':2, 'cat_2': 2}...}

In the case that category only takes single values, the straightforward, albeit inefficient way of doing this would be iterating over the rows of the DataFrame like this:

my_counter = defaultdict(Counter)
for i, row in df.iterrows():
    my_counter[row.entry][row.category] += 1

Depending on the value of category (one of cat_1, cat_2, cat_3) the appropriate _score is nonzero. So there has to be faster way to generate these counters without iterating over rows of the dataframe, and using vectorization instead.

Unfortunately this gets more complicated if there is the option of having multiple entries in the category column, separated by a semi colon, for example one row can have cat_1; cat_3 as an entry to this column, and the counter needs to take into account both cases. In that case, my way to work around it would be:

for i, row in df.iterrows():
    for cat in ['cat_1', 'cat_2', 'cat_3']:
        if cat in row.category:
            my_counter[row.entry][cat] += 1

Is there a way to substitute these two code snippets with a more pandas-preferred way?

nvergos
  • 432
  • 3
  • 15
  • Please provide *an actual example input*. That will give you a lot more traction on this question. Check out [this question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for some good tips on providing reproducible pandas examples... – juanpa.arrivillaga Dec 14 '17 at 17:56
  • Thank you for your comment, I just used `qtconsole` to get a slice" of my dataframe and added some data with an example. – nvergos Dec 14 '17 at 17:58
  • 1
    What? *no* that isn't helpful at all. Again, please read the link. There are many ways to provide reproducible data. That isn't one of them... – juanpa.arrivillaga Dec 14 '17 at 18:00
  • I just added a snippet that generates this toy dataframe, providing reproducible data. – nvergos Dec 14 '17 at 18:08

0 Answers0