0

In Google Analytics BigQuery, the following query tries to aggregate information both at the sessions and at the hits level

SELECT
  visitId,
  trafficSource.source as source,
  trafficSource.medium as medium,
  device.browser as browser,
  min(hits.hour) as firstHitHour,
  boolean(count(hits.eventInfo.eventCategory = "SomeValue")) as hasSomeValue
FROM
  [my-table.ga_sessions_20150216]
GROUP BY
  visitId, source, medium, browser;

The line boolean(count(hits.eventInfo.eventCategory = "SomeValue")) as hasSomeValue is intended to be true if, among all the hits of a row's session, at least one hit has the hits.eventInfo.eventCategory equal to SomeValue.

The expected results look like:

sessionId   source         medium      browser      firstHitHour     hasSomeValue
--------------------------------------------------------------------------------------
12318       google         cpc         firefox      12               true
13317       google         organic     safari       14               null
13551       bing           organic     firefox      14               true
13610       orange         display     chrome       14               true
14381       stackoverflow  referral    safari       15               false
14422       google         organic     chrome       15               true

However, the line mentioned above doesn't seem to work. Even if I put some dumb value, for instance hits.eventInfo.eventCategory = "Blablablabla", it will output true for some rows anyway (whereas obviously no hit has this dumb value).

Actually, the instruction count(hits.eventInfo.eventCategory = "Blablablabla") - without the boolean() returns seemingly random results (have nothing in common with the actual counts).

What should this line be to trigger the correct output?

Jivan
  • 21,522
  • 15
  • 80
  • 131

4 Answers4

1

You'll have an easier time using standard SQL. To check for presence of the value within hits, use an EXISTS clause over it. For example,

SELECT
  visitId,
  trafficSource.source as source,
  trafficSource.medium as medium,
  device.browser as browser,
  (SELECT MIN(hour) FROM UNNEST(hits)) as firstHitHour,
  EXISTS (SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventCategory = "SomeValue") as hasSomeValue
FROM
  `my-table.ga_sessions_20150216`
GROUP BY
  visitId, source, medium, browser;

See also the guide on migrating from legacy to standard SQL.

If you do want to stick with legacy SQL, you will need to use WITHIN RECORD in conjunction with the count over hits.

Elliott Brossard
  • 32,095
  • 2
  • 67
  • 99
  • Thanks - I didn't realize I had to uncheck the Legacy SQL box - now I have the `UNNEST expression references column hits which is neither grouped nor aggregated at` error from `UNNEST(hits)` – Jivan Nov 24 '16 at 16:56
  • See related question - http://stackoverflow.com/questions/40792981/unnest-expression-references-column-which-is-neither-grouped-nor-aggregated – Jivan Nov 24 '16 at 19:58
1

First, the problem of using boolean(count(hits.eventInfo.eventCategory = "SomeValue")) is that it will return true if hits.eventInfo.eventCategory has any non-NULL value (either "SomeValue" or another different) and it only will return false when all values of hits.eventInfo.eventCategory are null. But this is not an unexpected result:

  1. Using hits.eventInfo.eventCategory = "SomeValue", BQ assigns true or false according to the condition as expected to each hit with a non-NULL value and null if that hit has the null value.
  2. However, since count "Returns the total number of non-NULL values in the scope of the function", using count(hits.eventInfo.eventCategory = "SomeValue") (grouping by visitId) BQ returns the aggregated number of true AND false obtained before, and only returns 0 if all hits were null for a given visitId. In other words, it counts all the non-null values, not only those satisfying the condition, so the problem of your query is here.
  3. boolean(...) works as expected: returning true if the last expression is not 0 and not NULL, false if it is 0 and NULL if it is NULL (you won't obtain any null).

Second, as you said in your last comment, you can use some() instead of boolean(count()), but note that using it you will not obtain any null result, as you initially wanted in the provided example. You will obtain true if there is at least one hits.eventInfo.eventCategory equal to "SomeValue" and false otherwise. If this is ok for you, I think that this is the best option.

However, if you want to obtain null when all hits of the session have null hits.eventInfo.eventCategory, you could use:

SELECT
      visitId,
      trafficSource.source as source,
      trafficSource.medium as medium,
      device.browser as browser,
      min(hits.hour) as firstHitHour,
      boolean(MAX(              
               CASE WHEN hits.eventInfo.eventCategory = "SomeValue" THEN 1
                    WHEN hits.eventInfo.eventCategory != "SomeValue" THEN 0
                    ELSE null END
             )
         ) as hasSomeValue
FROM
  [my-table.ga_sessions_20150216]
GROUP BY
   visitId, source, medium, browser

First, to each hit we assign 1 if the condition is satisfied, 0 if it is not and null if the value is null. Second, by asking for the max(), we'll get 1 if there is at least one "SomeValue", 0 if there are non-null values but all of them are different from "SomeValue", and null if all hits have null-values for hits.eventInfo.eventCategory. Finally, we use boolean() to get true, false or null, depending on each case.

Pol Ferrando
  • 663
  • 4
  • 11
0

I found a way of outputting the desired result, by using some() instead of boolean(count())

SELECT
  visitId,
  trafficSource.source as source,
  trafficSource.medium as medium,
  device.browser as browser,
  min(hits.hour) as firstHitHour,
  some(hits.eventInfo.eventCategory = "SomeValue") as hasSomeValue
FROM
  [my-table.ga_sessions_20150216]
GROUP BY
  visitId, source, medium, browser;

I'm not really sure why boolean(count()) triggered unexpected results, however.

Jivan
  • 21,522
  • 15
  • 80
  • 131
0

Try below

#standardSQL
SELECT
  visitId, source, medium, browser,
  MIN(hour) AS firstHitHour,
  LOGICAL_OR(hasValue) AS hasValue
FROM (
  SELECT
    visitId,
    trafficSource.source AS source,
    trafficSource.medium AS medium,
    device.browser AS browser,
    h.hour AS hour,
    EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventCategory = "SomeValue") AS hasValue
  FROM
   `my-table.ga_sessions_20161122`, UNNEST(hits) AS h
)
GROUP BY
visitId, source, medium, browser;
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230