1

I have a date(createDt) partitioned BQ table that is being loaded by kafka stream data. I want to select only the latest record based on updatedTime grouped by ID column. I am able to achieve it through the 2nd top solution referred here. I want to create it as a view and did it by the following code

create or replace view myView as 
SELECT row.* FROM (
  SELECT ARRAY_AGG(t ORDER BY updatedTime DESC LIMIT 1)[OFFSET(0)] AS row
  FROM `yourTable` AS t
  GROUP BY ID
) 

I noticed that if I query myView using the partitioned date column createDtin where condition, the entire base table is being scanned which I don't want to happen.

However if I a create a normal view like

  create or replace view myNormalView as 
    SELECT * FROM  `yourTable`

and then query the using createDt, only the required partition is being scanned and very less bytes are read/billed.

My question - How do I make partition work on top of ARRAY_AGG(t ORDER BY updatedTime DESC LIMIT 1)[OFFSET(0)].

I am still a beginner in BigQuery. Any suggestion/solution is highly appreciated

  • Unfortunately, as your view retrieves the last state of your data (without duplicates), it processes all the table and doesn’t use partition. You can maybe think about another system to remove duplicates. – Mazlum Tosun Sep 16 '22 at 19:45
  • This is a question for a processing SQL warehouse and not BigQuery. But can you limit the days you need to look back? Then include a `where date(createDt)>current_date()-3` You need also the history, then run your query scheduled once a day into a table `myView _helper` and query in your view that table as well. This will only help, if there are many entries to an ID. Please give some more information about the size, how often the view will run. – Samuel Sep 17 '22 at 09:02
  • @Samuel This table is being loaded with streaming data. On an average we may get 500 - 1000 rows getting ingested. The volume is not high. The structure from Kafka will be parsed in JSON format by Spark and loaded to BQ. The BQ table has a nested structure in it. This data needs to be consumed for reporting(Tableau) where they will primarily filter on createDt, but they need only latest record per ID. So I thought I will create a view to handle this and then give the view to the reporting team. The reporting has be near real time. – Harikrishnan Balachandran Sep 17 '22 at 14:23
  • Is it payable for you to run the query of your current view once a day? How large is a table where you safe the quered data of your view? Would it be ok that for each optimized view to get this table and todays data billed? Are all column needed for the reporting or can be some removed to save the billing amount? – Samuel Sep 17 '22 at 17:38
  • @Samuel - I solved the issue. Check the answers below. Thanks for your time! – Harikrishnan Balachandran Sep 18 '22 at 11:50

1 Answers1

1

Turns out that if you apply ARRAY_AGG(t ORDER BY updatedTime DESC LIMIT 1)[OFFSET(0)] to all the columns individually and select them to define the view, the partitioned column works! Also I am lucky since the partitioned column createDt wouldn't change once the ID is created. So I included it in the group by. Below is the solution,

Create view myView
SELECT ID,createDT, 
  ARRAY_AGG(col1 ORDER BY updatedTime DESC)[OFFSET(0)] AS col1,
  ARRAY_AGG(col2 ORDER BY updatedTime DESC)[OFFSET(0)] AS col2
FROM `yourTable`
GROUP BY ID,createDT

Now if I query my view with createDT as filter, only the intended partition is being read. I have tested it with large data and it works

NOTE: I got curious if adding the partitioned column createDT in group by alone would solve the problem and tried the same in .* query posted in query. But it still doesn't work. I guess it essential to apply ARRAY_AGG to each column. Sounds silly but that's how it is :)