1

I have a table in the following form:

index, ingestion_id,        a,  b,  c,  d

0,     '2020-04-22-1600',   0a, 0b, 0c, 0d
1,     '2020-04-22-1700',   0a, 0b, 0c, 0d
2,     '2020-04-22-1600',   1a, 1b, 1c, 1d
3,     '2020-04-22-1700',   1a, 1b, 1c, 1d
4,     '2020-04-22-1800',   1a, 1b, 1c, 1d
...

I would like extract all the rows and columns where the ingestion_id is the highest. Thus it should return index 1 and index 4 for all rows and columns.

I found some examples, but they require that we pre-define the columns that we want to select. I don't know the columns in advance, but I do know that the table will have a column named ingestion_id. Here is an example:

SELECT *
    FROM (
        SELECT MAX(ingestion_id) as ingestion_id, a, b, c, d
        FROM table as t
        GROUP BY a, b, c, d
        ORDER BY a
    )

How can I select all columns where the ingestion_id is the highest and group by all columns except for the ingestion_id?


BONUS

Imagine the table now having the form:

index, ingestion_id,        a,  b,  c,  d

0,     '2020-04-22-1600',   0a, 0b, 0c, 0d
1,     '2020-04-22-1700',   0a, 0b, 0c, 0d
2,     '2020-04-22-1600',   1a, 1b, 1c, 1d
3,     '2020-04-22-1700',   1a, 1b, 1c, 1d
4,     '2020-04-26-1800',   2a, 2b, 2c, 2d
5,     '2020-04-26-1900',   2a, 2b, 2c, 2d
...

The answer provided by Gordon Linoff (as of 2020/04/26) will in this case only filter out row 5 as its the highest ingestion_id. We also need however row 1 and row 3 as the values (except for the column ingestion_id) are unique in the other columns.

halfer
  • 19,824
  • 17
  • 99
  • 186
WJA
  • 6,676
  • 16
  • 85
  • 152
  • 1
    What is `0b`? `1d`? I don't understand those values. Are they strings? – Gordon Linoff Apr 22 '20 at 15:19
  • Just strings yes, values – WJA Apr 22 '20 at 15:21
  • Why row 1 and not 3? – Gordon Linoff Apr 22 '20 at 16:12
  • Because the row/column values of a,b,c,d (not ingestion_id) are different across the columns/rows. 0a, 1a. As for row 3, we would take row 4 as the ingestion id is higher for row 4. – WJA Apr 22 '20 at 16:14
  • How can you "not know the columns in advance"? These are very well defined in a table. – Gordon Linoff Apr 22 '20 at 16:16
  • Because I am streaming data into tables coming from different sources. Each source has a different set of columns with different names. I subsequently load all the columns for a selection of rows (latest ingestion_id) and using a maptable I harmonize the column names. The column names are in a csv files, so I could use those, but nevertheless, prefer to keep my options open. – WJA Apr 22 '20 at 16:18

5 Answers5

4

This answers the original version of the question.

I would like extract all the rows and columns where the ingestion_id is the highest.

If I understand correctly, you can use window a functions:

select t.* except (seqnum)
from (select t.*, rank() over (order by ingestion_id desc) as seqnum
      from `t` t
     ) t
where seqnum = 1;

You can select all corresponding rows as:

select t.* except (seqnum, grpid, min_grpid_seqnum)
from (select t.*,
             min(seqnum) over (partition by grpid) as min_grpid_seqnum
      from (select t.*, rank() over (order by ingestion_id desc) as seqnum,
                   dense_rank() over (partition by a, b, c, d) as grpid
            from `t` t
           ) t
     ) t
where min_grpid_seqnum = 1;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Elegant solution it seems. Is it also an efficient solution? I am not so versed in SQL so I really don't know what causes to consume more memory. Like, I am just saying, if we would do with a different method instead (eg Grouping) would that be more efficient. This looks however efficient to me as you only seem to "compute" the rank which is a built in method I suppose. – WJA Apr 22 '20 at 16:27
  • Ah one thing what is missing. This requires that we only select one single ingestion_id. However, we would need the highest per combination of the other rows. – WJA Apr 26 '20 at 12:48
  • @JohnAndrews . . . I can't think of a more efficient solution. Perhaps you should ask a new question regarding the second comment -- I'm am not sure how multiple ingestion ids would apply to what you describe here. – Gordon Linoff Apr 26 '20 at 13:08
  • Your answer works perfectly fine in the example I gave before. However, a new use case came up. I have updated my question and started a bounty for this new example. Perhaps a combination of your answer can be used. – WJA Apr 26 '20 at 13:25
  • @JohnAndrews . . . I generally don't answer updated questions (it just leads to an unending cycle of revisions). That is why I suggested that you ask a *new* question. – Gordon Linoff Apr 26 '20 at 16:53
2

How can I select all columns where the ingestion_id is the highest and group by all columns except for the ingestion_id?
Each source has a different set of columns with different names

Below is for BigQuery Standard SQL and has no dependency on the naming for the rest of columns at all

#standardSQL
SELECT ARRAY_AGG(t ORDER BY ingestion_id DESC LIMIT 1)[OFFSET(0)].*  
FROM `project.dataset.table` t
GROUP BY TO_JSON_STRING((SELECT AS STRUCT * EXCEPT(ingestion_id) FROM UNNEST([t])))

If to apply to sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '2020-04-22-1600' ingestion_id, '0a' a, '0b' b, '0c'c, '0d' d UNION ALL
  SELECT '2020-04-22-1700', '0a', '0b', '0c', '0d' UNION ALL
  SELECT '2020-04-22-1600', '1a', '1b', '1c', '1d' UNION ALL
  SELECT '2020-04-22-1700', '1a', '1b', '1c', '1d' UNION ALL
  SELECT '2020-04-22-1800', '1a', '1b', '1c', '1d' 
)
SELECT ARRAY_AGG(t ORDER BY ingestion_id DESC LIMIT 1)[OFFSET(0)].*  
FROM `project.dataset.table` t
GROUP BY TO_JSON_STRING((SELECT AS STRUCT * EXCEPT(ingestion_id) FROM UNNEST([t])))

output is

Row ingestion_id    a   b   c   d    
1   2020-04-22-1700 0a  0b  0c  0d   
2   2020-04-22-1800 1a  1b  1c  1d   
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Why the to json string? – WJA Apr 22 '20 at 18:05
  • that allows you to achieve what you wanted - specifically - to extract all unknown columns and then group by them - so this solution actually works and fully answers your question - while other answer clearly does not :o) – Mikhail Berlyant Apr 22 '20 at 19:04
  • Sorry did not work for me as a new use case came up. I have updated my question and started a bounty. – WJA Apr 26 '20 at 13:25
  • not sure what you mean by "did not worked" - I just tested and it works for both original and updated question. anyway adding new answer for "updated" use-case – Mikhail Berlyant Apr 26 '20 at 15:11
  • can you clarify what exactly you mean when you said it does not work for you? i feel like you just missing something simple as it definitely works for me (unless I am missing something simple and obvious) – Mikhail Berlyant Apr 26 '20 at 17:54
  • Basically it returns all the rows. But I do have a lot of columns, that might explain a part of it. – WJA Apr 26 '20 at 18:22
  • 1
    that is not true really. this just mean you didn't get the answer properly and/or not trying to understand how it works. obviously answer is based on what you presented in your question - and if in reality your data is different for what in question - you need to apply some extra efforts ... – Mikhail Berlyant Apr 26 '20 at 18:25
  • 1
    Agreed. I will retry tomorrow and to debug where it comes from. I suspect that I have some other columns where the rows are non identical. – WJA Apr 26 '20 at 18:26
2

Below is for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 0 index, '2020-04-22-1600' ingestion_id, '0a' a, '0b' b, '0c'c, '0d' d UNION ALL
  SELECT 1, '2020-04-22-1700', '0a', '0b', '0c', '0d' UNION ALL
  SELECT 2, '2020-04-22-1600', '1a', '1b', '1c', '1d' UNION ALL
  SELECT 3, '2020-04-22-1700', '1a', '1b', '1c', '1d' UNION ALL
  SELECT 4, '2020-04-26-1800', '2a', '2b', '2c', '2d' UNION ALL
  SELECT 5, '2020-04-26-1900', '2a', '2b', '2c', '2d' 
)
SELECT ARRAY_AGG(t ORDER BY ingestion_id DESC LIMIT 1)[OFFSET(0)].*  
FROM `project.dataset.table` t
GROUP BY TO_JSON_STRING((SELECT AS STRUCT * EXCEPT(index, ingestion_id) FROM UNNEST([t])))

with output

Row index   ingestion_id        a       b       c       d    
1   1       2020-04-22-1700     0a      0b      0c      0d   
2   3       2020-04-22-1700     1a      1b      1c      1d   
3   5       2020-04-26-1900     2a      2b      2c      2d   
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
1

You've asked for "all the rows with the highest ingestion_id. According to your sample-data, you only have one value row with a highest value for ingestion_id

So, in order to present your data with the highest value you can use MAX() within a subquery and simply use SELECT * because you don't know all the columns that may exist, this would look something like this, in it's simplest format;

SELECT * FROM table
WHERE IngestionID = (SELECT MAX(IngestionID) FROM table);

Bonus Answer

    DECLARE @columns NVARCHAR(MAX)
    DECLARE @result NVARCHAR(MAX)

 SELECT @columns = STUFF(
                        (

SELECT ',' + z.COLUMN_NAME FROM information_schema.columns z WHERE z.table_name = 'datatable'
AND z.COLUMN_NAME NOT IN ('Index_ID','Ingestion_ID') 
FOR xml path('')
)
                        , 1
                        , 1
                        , '')

SET @result = 'SELECT MAX(Ingestion_ID) [Ingestion ID],' + (SELECT @columns) + ' FROM datatable GROUP BY ' + (SELECT @columns);

EXEC(@result)

Note: I've changed the table name to datatable to avoid SQL reserved keywords (same for index -> Index_ID)

Outputs

Ingestion ID    a   b   c   d
2020-04-22-1700 0a  0b  0c  0d
2020-04-22-1700 1a  1b  1c  1d
2020-04-26-1900 2a  2b  2c  2d

I suggest not including the index because this is always unique and will just cause it to return every row, but looking at your questions and your original script, you aren't looking to include it so I believe this script will do exactly what you need.

Tested against the following;

Column Name     DataType
Index_ID        int
Ingestion_ID    varchar(15)
a               varchar(2)
b               varchar(2)
c               varchar(2)
d               varchar(2)
Simon
  • 496
  • 4
  • 19
1

This can be done in standard SQL as follows.

I am assuming your data to reside in a temp table.

WITH temp AS ( SELECT 0 index, '2020-04-22-1600' ingestion_id, '0a' a, '0b' b, '0c'c, '0d' d UNION ALL SELECT 1, '2020-04-22-1700', '0a', '0b', '0c', '0d' UNION ALL SELECT 2, '2020-04-22-1600', '1a', '1b', '1c', '1d' UNION ALL SELECT 3, '2020-04-22-1700', '1a', '1b', '1c', '1d' UNION ALL SELECT 4, '2020-04-26-1800', '2a', '2b', '2c', '2d' UNION ALL SELECT 5, '2020-04-26-1900', '2a', '2b', '2c', '2d' )

select index,ingestion_id,a,b,c,d from (select index,ingestion_id,a,b,c,d,row_number() over(partition by a,b,c,d order ingestion_id desc) top from temp ) where top = 1

It will produce the following output:

index ingestion_id a b c d
1 2020-04-22-1700 0a 0b 0c 0d
3 2020-04-22-1700 1a 1b 1c 1d
5 2020-04-26-1900 2a 2b 2c 2d