2

Here's my pivot table

No  Keyword              Count
1   Sell Laptop Online   10
2   Buy Computer Online  8
3   Laptop and Case      5

Here's what I want

No   Word      Count
1    Online    18
2    Laptop    15
3    Sell      10
4    Buy        8
5    Computer   8
6    and        5
7    Case       5 

What I did is

df['Word'].apply(lambda x: x.str.split(expand=True).stack()).stack().value_counts()

But the result is

No   Word      Count
1    Online    2
2    Laptop    2
3    Sell      1
4    Buy       1
5    Computer  1
6    and       1
7    Case      1 

I want to weighted word count from a pivot table

Nabih Bawazir
  • 6,381
  • 7
  • 37
  • 70

2 Answers2

2

Use:

df1 = (df.set_index('Count')['Keyword']
         .str.split(expand=True)
         .stack()
         .reset_index(name='Word')
         .groupby('Word')['Count']
         .sum()
         .sort_values(ascending=False)
         .reset_index())

Explanation:

  1. Set Count to index by set_index for prevent lost this information
  2. Create DataFrame by split
  3. Reshape by stack
  4. Convert MultiIndex to columns by reset_index
  5. Aggregate sum
  6. sorting Series by Series.sort_values
  7. Last reset_index

Another solution - faster if larger DataFrame:

from itertools import chain

s = df['Keyword'].str.split()

df = pd.DataFrame({
    'Word' : list(chain.from_iterable(s.values.tolist())), 
    'Count' : df['Count'].repeat(s.str.len())
})

print (df)
       Word  Count
0      Sell     10
0    Laptop     10
0    Online     10
1       Buy      8
1  Computer      8
1    Online      8
2    Laptop      5
2       and      5
2      Case      5

df1 = df.groupby('Word')['Count'].sum().sort_values(ascending=False).reset_index()
print (df1)
       Word  Count
0    Online     18
1    Laptop     15
2      Sell     10
3  Computer      8
4       Buy      8
5       and      5
6      Case      5

Explanation:

  1. First repeat Count values by splitted values of Keyword to new DataFrame
  2. Aggregate sum, sorting Series and last reset_index

Solution with defaultdict:

from collections import defaultdict

out = defaultdict(int)
for k, c in zip(df['Keyword'], df['Count']):
    for x in k.split():
        out[x] += c

print (out)
defaultdict(<class 'int'>, {'Sell': 10,
                            'Laptop': 15, 
                            'Online': 18, 
                            'Buy': 8, 
                            'Computer': 8,
                            'and': 5,
                            'Case': 5})

#sorting by values and DataFrame constructor
#https://stackoverflow.com/a/613218
df = pd.DataFrame(sorted(out.items(), key=lambda kv: kv[1], reverse=True),
                  columns=['Word','Count'])
print (df)

       Word  Count
0    Online     18
1    Laptop     15
2      Sell     10
3       Buy      8
4  Computer      8
5       and      5
6      Case      5

Performance - depends of real data, but seems solution with defaultdict is fastest:

np.random.seed(456)

import string
from itertools import chain
from collections import defaultdict


a = np.random.randint(0, 20, 10000)
b = [' '.join(np.random.choice(list(string.ascii_letters), 
                               np.random.randint(3, 5))) for _ in range(len(a))]

df = pd.DataFrame({"Keyword":b, "Count":a})
#print (df)

In [49]: %%timeit
    ...: f1 = (df.set_index('Count')['Keyword']
    ...:          .str.split(expand=True)
    ...:          .stack()
    ...:          .reset_index(name='Word')
    ...:          .groupby('Word')['Count']
    ...:          .sum()
    ...:          .sort_values(ascending=False)
    ...:          .reset_index())
    ...: 
35.5 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [52]: %%timeit
    ...: from itertools import chain
    ...: 
    ...: s = df['Keyword'].str.split()
    ...: 
    ...: pd.DataFrame({
    ...:     'Word' : list(chain.from_iterable(s.values.tolist())), 
    ...:     'Count' : df['Count'].repeat(s.str.len())
    ...: }).groupby('Word')['Count'].sum().sort_values(ascending=False).reset_index()
    ...: 
14.5 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [53]: %%timeit
    ...: from collections import defaultdict
    ...: 
    ...: out = defaultdict(int)
    ...: for k, c in zip(df['Keyword'], df['Count']):
    ...:     for x in k.split():
    ...:         out[x] += c
    ...: pd.DataFrame(sorted(out.items(), key=lambda kv: kv[1], reverse=True), columns=['Word','Count'])
    ...: 
8.82 ms ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#Dark solution
In [54]: %%timeit
    ...: df['Keyword'].str.get_dummies(sep=' ').mul(df['Count'],0).sum(0).to_frame('Count')
    ...: 
307 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Here's a simple method with one hot encoding.

df['Keyword'].str.get_dummies(sep=' ').mul(df['Count'],axis=0).sum(0).to_frame('Count')

          Count
Buy           8
Case          5
Computer      8
Laptop       15
Online       18
Sell         10
and           5

Incase of boost in speed try multi label binarizer from scikit. i.e

from sklearn.preprocessing import MultiLabelBinarizer
vec = MultiLabelBinarizer()

oh = (vec.fit_transform(df['Keyword'].str.split()) * df['Count'].values[:,None]).sum(0)
pd.DataFrame({'Count': oh ,'Word':vec.classes_})

Explanation :

Get dummies will lead to a hot hot encoded dataframe,

    Buy  Case  Computer  Laptop  Online  Sell  and
 0    0     0         0       1       1     1    0
 1    1     0         1       0       1     0    0
 2    0     1         0       1       0     0    1

Multiply with the count across the columns

   Buy  Case  Computer  Laptop  Online  Sell  and
0    0     0         0      10      10    10    0
1    8     0         8       0       8     0    0
2    0     5         0       5       0     0    5

Sum them and convert to dataframe.

Buy          8
Case         5
Computer     8
Laptop      15
Online      18
Sell        10
and          5
dtype: int64
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108