0

I have a dataset like this:

d = pd.DataFrame({
    'users_list':[["us1",  "us2", "us3", "us5", "us5"], ['us2', "us3", 'us2']], 
    'users_tuples': [[('us1', 'us2'), ('us2', 'us3'), ('us5', 'us1'), ('us5', 'us1')], [('us2', 'us3'), ('us3', 'us2')]]})

First I get a list of all users without repetition, like this:

all_users = sorted(list(set(sum([x for x in d['users_list']],[]))))

And then I have is this below:

for us in all_users:
   d[us] = d.apply(lambda x :  [1 if (a, us) in x['users_tuples'] else 0 for a in x['users_list']], 1)

But the answer I got is a list:

us1              us2              us3             us5
[0, 0, 0, 1, 1] [1, 0, 0, 0, 0] [0, 1, 0, 0, 0] [0, 0, 0, 0, 0]
[0, 0, 0]        [0, 1, 0]      [1, 0, 1]       [0, 0, 0]

And I want the sum of each one of these, so it will be:

us1 us2 us3 us5
2   1   1   0
0   1   2   0

I know that to have this i can do:

for us in all_users:
       d[us] = d.apply(lambda x :  sum([1 if (a, us) in x['users_tuples'] else 0 for a in x['users_list']]), 1)

But I think is not efficient at all of these transformations and I was wondering if is there a more efficient way to do them.

halfer
  • 19,824
  • 17
  • 99
  • 186
Catarina Nogueira
  • 1,024
  • 2
  • 12
  • 28

1 Answers1

1

You can try using collection.Counter to count the occurrences, then convert that to a dataframe and use df.reindex and fill missing values with 0 using df.fillna

def f(x):
    l = map(itemgetter(1), x) # equivalent to `(v for _,v in x)` or `map(lambda v: v[1], x)`
    return Counter(l)

(pd.DataFrame(d['users_tuples'].map(f).tolist()).
    reindex(set(chain.from_iterable(d['users_list'])),axis=1).fillna(0))

   us2  us5  us3  us1
0    1  0.0    1  2.0
1    1  0.0    1  0.0
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
  • Hi, thanks for the reply! Can you explain why this is more efficient? – Catarina Nogueira Jul 20 '20 at 09:28
  • 1
    using `for-loop` on a dataframe is significantly slower than using vectorized solution. And `sum([x for x in d['users_list']],[])` don't use this flatten a list it's very inefficient and quadratic runtime. use `chain/ chain.from_iterable`. And you test yourself by using `timeit` ;) [`read about it here`](https://stackoverflow.com/a/952946/12416453) @CatarinaNogueira – Ch3steR Jul 20 '20 at 09:32
  • 1
    And `df.apply` over axis 1 is also inefficient and should be used as last resort. Read more about it [here posted by cs95](https://stackoverflow.com/q/54432583/12416453) and you are using `df.apply` in every iteration @CatarinaNogueira – Ch3steR Jul 20 '20 at 09:37