1

We have two tables in BigQuery like below:

Table A

 Name | Question  | Answer
 -----+-----------+-------
 Bob  | Interest  | a
 Bob  | Interest  | b
 Sue  | Interest  | a
 Sue  | Interest  | c
 Joe  | Interest  | a
 Joe  | Interest  | b
 Joe  | Interest  | c
 Joe  | Interest  | d

Table B (Static)

           Interests                        |   Segment
--------------------------------------------+------------------
["a"]                                       |   S1
["a","b"]                                   |   S2 
["a", "b", "c", "d"]                        |   S3

Expected table

 User | Question  | Answer
 -----+-----------+-------
 Bob  | Interest  | a
 Bob  | Interest  | b
 Sue  | Interest  | a
 Sue  | Interest  | c
 Joe  | Interest  | a
 Joe  | Interest  | b
 Joe  | Interest  | c
 Joe  | Interest  | d
          (+)
 Bob  | Segment   | S1
 Bob  | Segment   | S2
 Sue  | Segment   | S1
 Joe  | Segment   | S1
 Joe  | Segment   | S2
 Joe  | Segment   | S3 

In the above tables, the Answer field is of string type and Interests is of array type.

Pointers:

  1. One user can have one or more interests.
  2. One or more interests can belong to one segment.
  3. A user will be assigned to a segment only when all of his interests are matched.

Any inputs/thoughts in the right direction would be greatly appreciated.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • please take a look at https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad – nbk Nov 02 '20 at 10:13

2 Answers2

1

Below is for BigQuery Standard SQL

#standardSQL
select name, question, answer from `project.dataset.tableA`
union all
select name, 'Segment', segment
from (
  select 
    name, 'Segment', segment,
    ( select countif(y is null)
      from b.interest x
      left join a.answers y
      on x = y
    ) = 0 qualified
  from (
    select name, array_agg(answer) answers
    from `project.dataset.tableA`
    group by name
  ) a, `project.dataset.tableB` b
)
where qualified    

if to apply to sample data from your question - output is

enter image description here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • That worked like a charm! However, I didn't quite understand the logic here. A short description would be of great help. –  Nov 03 '20 at 03:11
  • sure. will add comments tomorrow :o) – Mikhail Berlyant Nov 03 '20 at 05:37
  • brief explanation: 1) most inner select - simply generates array of answers for each user; 2) then result is cross joined with segments (tableB) and `qualified` attribute is calculated (see details below in #4); 3) finally output is filtered to only those which qualified and union with tableA; 4) logic for `qualified` is to check rule #3 in your question by counting how many user's interests are above what is in segments - if it is zero - it is qualified. Hope this helps to understand better the solution in my answer – Mikhail Berlyant Nov 03 '20 at 17:57
0

This looks like a union all -- where the second query unnests the interests and joins:

select a.name, a.question, a.answer
from a
union all
select a.name, 'segment', min(b.segment)
from a join
     (b cross join
      unnest(interests) b_interest
     )
     on a.interest = b_interest
group by name
having min(b.segment) = max(b.segment);

You might need a left join, if some interests don't have segments.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Would it also fulfill point #3? Only if all the values are in the array, it should get matched. –  Nov 02 '20 at 13:36
  • Getting this error `SELECT list expression references b.segment which is neither grouped nor aggregated at [11:29]` while executing the above query –  Nov 02 '20 at 13:51
  • Any thoughts? Looks like the segment field has to be an aggregatable value. How that can be achieved? –  Nov 02 '20 at 14:14
  • @Gopal . . . I fixed the answer a while ago. – Gordon Linoff Nov 02 '20 at 14:15
  • Tried this.. but not getting any results. Just the initial select statement is giving the results. –  Nov 02 '20 at 14:18