2

I have website visitor data that resembles the example below:

id pages
001 /ice-cream, /bagels, /bagels/flavors
002 /pizza, /pizza/flavors, /pizza/recipe

I would like to transform to below, where I can count the amount of times they have visited a part of my site that deals with specific content. A general count of all pageviews, delimited by comma, would be great as well.

id bagel_count
001 2
002 0
id pizza_count
001 0
002 3
id total_pages_count
001 3
002 3

I have the option to perform in SQL or Python but I am not sure what is easier, hence why I am asking the question.

I have referenced following questions but they are not serving my purpose:

hansolo
  • 903
  • 4
  • 12
  • 28
  • Please tag your database. Also, is the list of pages/words you're trying to get a count on, known and finite? – Radagast Nov 29 '21 at 15:15
  • 1
    @PhilCoulson apologies - I am using Snowflake - I know the list of words I am trying to get a count on - i.e ice cream, pizza, bagels - BENY gave me a good solution to get all words below but a targeted version would be great too – hansolo Nov 29 '21 at 15:23

4 Answers4

3

We can do split then explode and get your result with crosstab

df['pages'] = df.pages.str.split(r'[/, ]')
s = df.explode('pages')
out = pd.crosstab(s['id'], s['pages']).drop('', axis=1)
out
Out[427]: 
pages  bagels  flavors  ice-cream  pizza  recipe
id                                              
1           2        1          1      0       0
2           0        1          0      3       1
BENY
  • 317,841
  • 20
  • 164
  • 234
  • This is a really cool, automated way to approach this and I appreciate the quick answer - The issue I have is that some of my links are verbose, such as `pizza/blog/pepperoni/learn-how-to-cook-best-pizza-with-doc-brown` in this case I feel like it would be a bad idea to have counts of `learn` `how` `to` `cook` ... any way I can key in on specific words I am trying to count or would you recommend taking this approach and eliminating "stop word" columns in the df after? – hansolo Nov 29 '21 at 15:05
  • 1
    @hansolo count it then remove the unwanted column could be better ~ – BENY Nov 29 '21 at 15:17
2

I personally like to use regular expressions with groups, then explode into a df which I merge back into the main. This has several advantages over the split method, mainly, conserving excessive memory usage which results in dramatic performance improvement.

import re
from typing import List, Dict
import pandas as pd

my_words = [
    'bagels',
    'flavors',
    'ice-cream', 
    'pizza',
    'recipe'
]

def count_words(string:str, words:List[str]=my_words) -> Dict[str, int]:
    """
    Returns a dictionary of summated values
    for selected words contained in string
    """
    
    # Create a dictionary to return values
    match_dict = {x:0 for x in words}
    
    # Numbered capture groups with word boundaries
    # Note this will not allow pluralities, unless specified
    # Also: cache (or frontload) this value to improve performance
    my_regex_string = '|'.join((fr'\b({x})\b' for x in words))
    my_pattern = re.compile(my_regex_string)
    
    for match in my_pattern.finditer(string):
        value = match.group()
        match_dict[value] +=1
    
    return match_dict


# Create a new df with values from function
new_df = df['pages'].apply(match_words).apply(pd.Series)

    bagels  flavors ice-cream   pizza   recipe
0   2   1   1   0   0
1   0   1   0   3   1


# Merge back to the main df
df[['id']].merge(new_df, left_index=True, right_index=True)

id  bagels  flavors ice-cream   pizza   recipe
0   1   2   1   1   0   0
1   2   0   1   0   3   1
Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
  • Very nice solution. Would this be possible if instead of a list of words I would have a dict of words, and I wanted to count occurrences of each group (key) of words. E.g. my_words = { 'food':['bagels','pizza','ice-cream'], 'flavors':['chocolate','strawberry','ice-cream'], 'others':['recipe'] } – MrT77 Dec 23 '22 at 18:15
  • It certainly is possible, but you'd have to do some fancy merge statements. The simplest approach would be to loop through your dict and apply this approach – but instead of `.apply(pd.Series)` you should use `.sum()` @MrT77 – Yaakov Bressler Dec 24 '22 at 14:35
1

I would go this route if you prefer SQL. I typically leave pivoting to reporting applications, but if you really insist, Snowflake has good documentation on it for you to take it from here

with cte (id, pages) as

(select '001', '/ice-cream, /bagels, /bagels/flavors' union all
 select '002', '/pizza, /pizza/flavors, /pizza/recipe')
  
  
select id, 
       t2.value, 
       count(*) as word_count,
       length(pages)-length(replace(pages,',',''))+1 as user_page_count
from cte, lateral split_to_table(translate(cte.pages, '- ,','/'),'/') as t2--normalize word delimiters using translate(similar to replace)
where t2.value in ('bagels','pizza') --your list goes here
group by id, pages, t2.value;
Radagast
  • 5,102
  • 3
  • 12
  • 27
0

Going to mark @BENY's answer correct because of its elegance but I found a way to do this in python, focusing on a specific keyword - Assume df looks like my original table

df['bagel_count'] = df["pages"].str.count('bagel')

hansolo
  • 903
  • 4
  • 12
  • 28