1

I have a MultiIndex series look like this:

user_id  cookie  browser
1        1_1     [chrome45]
2        2_1     [IE 7]
2        2_2     [IE 7, IE 8]

There are two levels to this MultiIndex, user_id and cookie. The value is the browser.

What I want to do is to count the number of times a user uses a different browser.

So for user 1 in this case, he only used 1 browser. But for user 2, he used three browsers (IE7 appeared twice under different cookies, so I count it twice instead of once)

How can I loop through it and get a result like this:

r = defaultdict(int)

for user_id in multiIndex_series:
    for cookie in multiIndex_series[user_id]:
        r[user_id] += len(multiIndex_series[user_id][cookie]) # I don't know how to get user_id out of the MultiIndex series
Cheng
  • 16,824
  • 23
  • 74
  • 104

1 Answers1

2

You can use groupby with apply lambda function where get length of flatten lists - see answer for more info:

df = pd.DataFrame({'user_id':[1,2,2],
                   'cookie':['1_1','2_1','2_2'],
                   'browser':[['chrome45'],['IE 7'],['IE 7','IE 8']]})
df = df.set_index(['user_id','cookie'])
print (df)
                     browser
user_id cookie              
1       1_1       [chrome45]
2       2_1           [IE 7]
        2_2     [IE 7, IE 8]

from  itertools import chain
print (df.groupby(level='user_id')['browser']
         .apply(lambda x: len(list(chain.from_iterable(x)))))
user_id
1    1
2    3
Name: browser, dtype: int64

Instead lambda is possible use custom function f what is better way for testing:

def f(x):
    print (list(chain.from_iterable(x)))
    return len(list(chain.from_iterable(x)))

['chrome45']
['IE 7', 'IE 7', 'IE 8']

print (df.groupby(level='user_id')['browser'].apply(f))
user_id
1    1
2    3
Name: browser, dtype: int64

If need loop in series, one possible solution is:

for user_id, val in df['browser'].iteritems():
    print (user_id)
    print (val)

['chrome45']
(2, '2_1')
['IE 7']
(2, '2_2')
['IE 7', 'IE 8']
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks so much for the help! The part that I don't quite get is the `x` in `chain.from_iterable(x)`. What is `x` in this case? I tried printing out `x` and I can see it is the entire group. How does `chain.from_iterable` pick out the `browser` column specifically? (because of the `groupby(...)['browser']` part? ) – Cheng Dec 15 '16 at 08:58
  • If use `df.groupby(level='user_id')['browser']` and then apply some function, then in each loop is in `x` variable df['browser'] per group. The best is test it in `f` function by `def f(x): print (x)` – jezrael Dec 15 '16 at 09:01
  • I almost did this `for user_id in multiIndex.series.index.level[0] for cookie in multiIndex.series[userid]`. Thanks for saving me :) – Cheng Dec 15 '16 at 09:01
  • 1
    Yes, but loop in pandas are obviosly slow, so better is avoid them. see perfect [answer](http://stackoverflow.com/a/24871316/2901002) - `Jeff` is now developer of pandas. – jezrael Dec 15 '16 at 09:03