3

I am working in BigQuery to understand how many users complete a specific page path (at any point in the session). Lets say the page path is Page 1 -> Page 2 -> Page 3. The pages must be followed in sequential order. I am able to use BQ to establish a page path - but this method only works to identify users who reached those pages at any point in the session. For example, Page 1 -> Page 456 -> Page 2.

Any ideas?

(SELECT [date]
, CASE WHEN pages like '/Page1' then fullVisitorId end as [users]
, CASE WHEN pages like '/Page1>>/Page2' then fullVisitorId end as [path_users_2]
, CASE WHEN pages like '/Page1>>Page2>>Page3' then fullVisitorId end as [path_users_3]
, [path_type]
, [path]
, [product]
, [device.deviceCategory]
FROM

  ( SELECT [date]
    , [fullVisitorId]
    , [visitId]
    , [visitNumber]
    , group_concat(hits.page.pagePath,'>>') as [pages]
    , 'New Pages' as [path_type]
    , 'Upgrade' as [path]
    , 'Professional' as [product]
  FROM
      (
      TABLE_DATE_RANGE
          ( [XXXXXX.ga_sessions_]
          , TIMESTAMP('2014-06-01')
          , TIMESTAMP('2014-06-05') )
      )
  where
  (REGEXP_MATCH(hits.page.pagePath,r'^/Page1($|/$|\?|/\?|%3F)'))
  or (REGEXP_MATCH(hits.page.pagePath,r'^/Page2($|/$|\?|/\?|%3F)'))
  or ( (REGEXP_MATCH(hits.page.pagePath,r'^/Page3($|/$|\?|/\?|%3F)'))
  and hits.transaction.transactionId is not null
  and hits.item.productSku is not null
  and hits.item.itemRevenue is not null )
  group each by [date]
  , [fullVisitorId]
  , [visitId]
  , [visitNumber]
  , [path_type]
  , [path]
  , [product]
  , [device.deviceCategory]
  )
group each by
[date]
, [path_type]
, [path]
, [product]
, [users]
, [path_users_2]
, [path_users_3]
, [device.deviceCategory]

)

sharks552
  • 73
  • 2
  • 8

2 Answers2

4

/For your particular use case, I'm pretty sure you can do this with much faster execution time by avoiding both JOIN and GROUP BY.

Consider:

SELECT
  [date], fullVisitorId, visitId, visitNumber,
  GROUP_CONCAT(REGEXP_EXTRACT(hits.page.pagePath, '^(/[^/?]*)'), ">>")
    WITHIN RECORD AS Sequence,
FROM
  (TABLE_DATE_RANGE
      ( [XXXXXX.ga_sessions_]
      , TIMESTAMP('2014-06-01')
      , TIMESTAMP('2014-06-05') )
  )
WHERE REGEXP_MATCH(hits.page.pagePath, r'^/Page[123]')
HAVING
  Sequence CONTAINS "/Page1>>/Page2>>/Page3";

This takes advantage of scoped aggregation at the RECORD level to avoid having to GROUP BY individual sessions.

Furthermore, individual records are atomic within Bigquery, and their repeated fields are processed in the order they were supplied at import time. For GA session logs, therefore, the hits subrecords are concatted in sequential order as everything was done WITHIN RECORD. Flattening the hit timestamps and then joining them with a comparison really just redoes this work.

Matthew Wesley
  • 616
  • 3
  • 5
2

You need to construct a sequence of queries, and reach step by step to your full path, using hits.time as time sequence. Taking example from Streak blog post: Using Google BigQuery for Event Tracking

We can create a subquery to determine the visitHomepage events:

(SELECT sessionId as sessionId1,
        timestamp as timestamp1
 FROM [events.log]
 WHERE name = "visitHomepage") AS step1

then similar for step2, step3.

then you can combine these to obtain steps1_2

(SELECT sessionId1,
        timestamp1,
        IF(timestamp1 < timestamp2, timestamp2, NULL) as timestamp2
 FROM
      (SELECT sessionId1,
              timestamp1,
              timestamp2
       FROM step1
       LEFT JOIN step2
       ON sessionId1 = sessionId2)
) AS steps1_2

Obtaining the subquery we desire!

(SELECT sessionId1 as sessionId,
        timestamp1 as visitHomepageTimestamp,
        timestamp2 as installExtensionTimestamp,
        IF(timestamp2 < timestamp3, timestamp3, NULL) as signInTimestamp
 FROM
      (SELECT sessionId2,
              timestamp2,
              timestamp3
       FROM steps1_2
       LEFT JOIN step3
       ON sessionId1 = sessionId3)
) AS steps1_2_3

Read the above linked blog post to have a granular step by step explanation how to construct the query, and also check out BigQuery Cookbook.

Alternatively you can order your query based on hits.time to define order of pages visited by the user and use ROW_NUMBER or POSITION to add them the sequence number, than you can use that result set further.

Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • Thanks! This is really helpful. My only follow-up question is in BQ using hits.time would work to structure the sequence properly. However, even if time of load page 1 < time of load page 2 I could have visited a page in-between, thus breaking the 'exact' sequence of pages 1 - 2 - 3. So it seems like there isn't really a way to ensure that an exact page sequence is followed (like there is in GA via advanced segments). – sharks552 Mar 04 '15 at 22:17
  • @sharks552 check out my other answer, mostly about using POSITION function, you can combine somehow that technique to get the exact position of the hits inside the repeated field. If this is not a repeated field that you can use ROW_COUNT to prepare a subquery which returns correct sequence number (1,2,3) ordered by hits.time. Then you can work on the results of that query. http://stackoverflow.com/questions/28557636/how-to-find-last-item-in-a-repeated-structure-in-bigquery/28558099#28558099 – Pentium10 Mar 05 '15 at 06:10
  • I keep running into an error: Error: Cannot query the cross product of repeated fields a.hits.hitNumber and b.hits.hitNumber. I am trying to pull out pages where the hitNumber from table B is +1 the hitNumber from table A (if table A is page 1, table B is page 2) but it seems that BQ is not letting me do that. I tried using position(a.hits.hitNumber) etc. but that did not work either. – sharks552 Mar 06 '15 at 16:26
  • @sharks552 that's means you need to FLATTEN your source table, you can search for the syntax, but there is a limitation with table queries. – Pentium10 Mar 06 '15 at 19:38