0

I have a table in BigQuery which gets new data daily and appends it to the current table. This table is called : score_FULL. Additionally, I keep the individual scores on a daily basis, which can be accessed by score_20180125 if we choose today's date. Daily scores are appended to score_FULL on a daily basis.

score_FULL contains:

visitorID         score
#Older scores first
1                 0.15
2                 0.78
3                 0.12
6                 0.90
------------------------
2                 0.22
6                 0.65
7                 0.61
10                0.24
------------------------
1                 0.31
2                 0.41
10                0.12
-------------------------
#Newest scores appended

I would like to see user score changes. Each time a user gets a new score, we append it horizontally. Every-time we get a new user, we append it vertically. Assuming a user can only get 1 score per day, the ideal solution is:

visitorID     score1      score2       score3
1             0.15        0.31
2             0.78        0.22         0.41
3             0.12 

E.g. a table that grows horizontally (new scores) AND vertically (new users)

I can do something similar using a sequence of Left Joins on individual tables, but this would only give me the visitor data from 1st table we run left joins from.

Note: I can add another column Date which will simply repeat the same date for all values, if it makes things easier.

GRS
  • 2,807
  • 4
  • 34
  • 72
  • do you have in the full table the date for each row? – Pentium10 Jan 25 '18 at 10:04
  • No I don't but I do have the date from `score_DATE` table, which gets appended to the `FULL` table. Although, if there is an easy solution, I can add the date to the table too. So it will append the same date for all rows appended that day – GRS Jan 25 '18 at 10:08
  • Have you ever worked with Window Functions or table wildcard queries? What's the maximum day span your data goes back? – Pentium10 Jan 25 '18 at 12:40
  • @Pentium10 No, I haven't. Only goes back 6 days at the moment, but this will grow daily. I could probably make it so that only last `n` days are reported. – GRS Jan 25 '18 at 14:18
  • do you use table partitioning? If not why not? – Pentium10 Jan 25 '18 at 15:15
  • @Pentium10 What do you mean by partitioning? Each day has it's own table `score_DATE` and `score_FULL` is just a UNION ALL of all rows – GRS Jan 30 '18 at 15:45
  • Instead of having a day and full table you can have on single Partitioned table. https://cloud.google.com/bigquery/docs/partitioned-tables – Pentium10 Jan 30 '18 at 18:29
  • I believe the term you're looking for is `pivot`. However this function doesn't seem to be supported in *bigquery* yet. But after a quick google it seems like there are ways around it. Just [look at this example](https://stackoverflow.com/questions/40761453/transpose-rows-into-columns-in-bigquery-pivot-implementation) – Oceans Jan 31 '18 at 13:53
  • @GRS - Could you maybe add the current query to your question, the one you used to union the different tables into a single view? Because I don't really get the sample data you provided because I see *VisitorID 1* has 2 different values for day 1: *0.15* and *0.44* and in your desired result you simply ignore the second value. Is that just a typo? Just add the query for *score_FULL* – Oceans Feb 02 '18 at 08:43
  • @Oceans My scripts appends daily data from `score_date` to the `score_FULL` so the query is just to SELECT * FROM `score_date` with the destination table set on append. The reason there are 2 scores for 1 user is to highlight that a single user can make multiple visits during the same day. Any value is okay to save a generally they are similar. – GRS Feb 02 '18 at 10:00
  • @Oceans The more I think about it, the less likely I think I will arrive at the conclusion that I want with the current structure. – GRS Feb 02 '18 at 10:09
  • @Oceans I think its easier to just keep all scores, so for user 1, the sequence is: `[0.15,0.44,0.31]`. This way I just need to create some sort of pivot e.g. for each visitor Id, list the scores horizontally. – GRS Feb 02 '18 at 12:00
  • @GRS - Can you edit your question accordingly? Because you're making it really confusing. Just say what you actual want. Do you want a the columns to represent a day or just scores? Do you want it to be dynamic?(keep in mind you'll probably have to limit the amount of columns at one point. You'll have to decide how you want to limit this, last X amount of scores/days) Just start over, say what data you have available and the desired result, don't think about limitations just describe what you want first. – Oceans Feb 02 '18 at 12:25
  • @Oceans Sorry, I hope I made it clearer now – GRS Feb 02 '18 at 12:34

1 Answers1

2

Instead of adding columns dynamically (which is quite a challenge here) - I would recommend aggregating respective visitor's scores in one column as either array or string of individual scores

Below is for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.score_FULL` AS ( 
  SELECT 1 visitorID, 0.15 score UNION ALL
  SELECT 2, 0.78 UNION ALL
  SELECT 3, 0.12 UNION ALL
  SELECT 6, 0.90 UNION ALL
  SELECT 2, 0.22 UNION ALL
  SELECT 6, 0.65 UNION ALL
  SELECT 7, 0.61 UNION ALL
  SELECT 10, 0.24 UNION ALL
  SELECT 1, 0.31 UNION ALL
  SELECT 2, 0.41 UNION ALL
  SELECT 10, 0.12 
)
SELECT 
  visitorID,
  ARRAY_AGG(score) scores_as_array,
  STRING_AGG(CAST(score AS STRING)) scores_as_list
FROM `project.dataset.score_FULL`
GROUP BY visitorID   

with output as

Row visitorID   scores_as_array scores_as_list   
1   1           0.15            0.15,0.31    
                0.31         
2   2           0.78            0.78,0.22,0.41   
                0.22         
                0.41         
3   3           0.12            0.12     
4   6           0.9             0.9,0.65     
                0.65         
5   7           0.61            0.61     
6   10          0.24            0.24,0.12    
                0.12         
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230