0

I am trying to get unique source, medium, campaign using STRING_AGG function partitioning by fullVisitorId and ORDER BY visitNumber. However, I am getting a lot of repeated fields in Source, medium and campaign using STRING_AGG.

So I thought of using DISTINCT but it doesn't work with STRING_AGG.

Any advice appreciated.

Code:

SELECT
  CAST(concat(SUBSTR(date,0,4), '-', SUBSTR(date,5,2),'-',SUBSTR(date,7,2)) as date ) as date,
  fullVisitorId,
  trafficSource.keyword,
  trafficSource.adContent,

STRING_AGG(channelGrouping), ' > ') OVER (PARTITION BY fullVisitorId ORDER BY visitNumber) AS channelPath,
STRING_AGG(trafficSource.medium), ' > ') OVER (PARTITION BY fullVisitorId ORDER BY visitNumber ) AS medium_Path,
STRING_AGG(trafficSource.source), ' > ') OVER (PARTITION BY fullVisitorId ORDER BY visitNumber) AS source_Path,
STRING_AGG(trafficSource.campaign), ' > ') OVER (PARTITION BY fullVisitorId ORDER BY visitNumber) AS campaign_Path

FROM
  `tfa-big-query.74006564.ga_sessions_*`,UNNEST(hits) as hit
WHERE _table_suffix BETWEEN '20180801' and '201808015'

The results in the STRING_AGG column (channelgrouping, medium and source is the occurrence of repetitive fields which can be seen as follows:

Actual results:

Organic Search > Organic Search > Organic Search < Other < Other < Direct < Owned < email < email < (none) < (none)

Expected Output:

Organic Search < Other < Direct < Owned < email < (None)

Siddhant Naik
  • 51
  • 1
  • 10

2 Answers2

1

Below is for BigQuery Standard SQL

#standardSQL
CREATE TEMP FUNCTION DeDup(arr ANY TYPE) AS ((
  SELECT STRING_AGG(str, ' > ') FROM (
    SELECT CONCAT(ANY_VALUE(x), ' (', CAST(COUNT(1) AS STRING), ')') str FROM (
      SELECT x, COUNTIF(flag) OVER(ORDER BY OFFSET) grp FROM (
        SELECT *,x!=IFNULL(LAG(x) OVER(ORDER BY OFFSET), x) flag FROM UNNEST(arr) x WITH OFFSET
      ))
    GROUP BY grp
)));
SELECT 
  DATE,
  fullVisitorId,
  keyword,
  adContent,
  DeDup(channelPath) AS channelPath,
  DeDup(medium_Path) AS medium_Path,
  DeDup(source_Path) AS source_Path,
  DeDup(campaign_Path) AS campaign_Path
FROM (
  SELECT
    CAST(CONCAT(SUBSTR(DATE,0,4), '-', SUBSTR(DATE,5,2),'-',SUBSTR(DATE,7,2)) AS DATE ) AS DATE,
    fullVisitorId,
    trafficSource.keyword,
    trafficSource.adContent,
    ARRAY_AGG(channelGrouping) OVER(PARTITION BY fullVisitorId ORDER BY visitNumber) AS channelPath,
    ARRAY_AGG(trafficSource.medium) OVER(PARTITION BY fullVisitorId ORDER BY visitNumber ) AS medium_Path,
    ARRAY_AGG(trafficSource.source) OVER(PARTITION BY fullVisitorId ORDER BY visitNumber) AS source_Path,
    ARRAY_AGG(trafficSource.campaign) OVER(PARTITION BY fullVisitorId ORDER BY visitNumber) AS campaign_Path
  FROM `tfa-big-query.74006564.ga_sessions_*`,UNNEST(hits) as hit
  WHERE _table_suffix BETWEEN '20180801' and '201808015'
)

Above will return something like

Organic Search (3) > Other (2) > Direct (1) > Owned (1) > email (2) > (none) (2)

which is a little extension of your expected result Organic Search < Other < Direct < Owned < email < (None)

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
0

I thought STRING_AGG work with DISTINCT

try this sql from bigquery document

SELECT STRING_AGG(DISTINCT fruit, " & ") AS string_agg
FROM UNNEST(["apple", NULL, "pear", "banana", "pear"]) AS fruit;
c1mone
  • 208
  • 1
  • 7