Get distinct information across many fields some of which are NULL

Question

I have a table with just over 65 million rows and 140 columns. The data comes from several sources and is submitted at least every month.

I look for a quick way to grab specific fields from this data only where they are unique. Thing is, I want to process all the information to link which invoice was sent with which identifying numbers and it was sent by whom. Issue is, I don't want to iterate over 65 million records. If I can get distinct values, then I will only have to process say 5 million records as opposed to 65 million. See below for a description of the data and SQL Fiddle for a sample

If say a client submits an invoice_number linked to passport_number_1, national_identity_number_1 and driving_license_1 every month, I only want one row where this appears. i.e. the 4 fields have got to be unique

If they submit the above for 30 months then on the 31st month they send the invoice_number linked to passport_number_1, national_identity_number_2 and driving_license_1, I want to pick this row also since the national_identity field is new hence the whole row is unique

By linked to I mean they appear on the same row
For all fields its possible to have Null occurring at one point.
The 'pivot/composite' columns are the invoice_number and submitted_by. If any of those aren't there, drop that row
I also need to include the database_id with the above data. i.e. the primary_id which is auto generated by the postgresql database
The only fields that don't need to be returned are the other_column and yet_another_column. Remember the table has 140 columns so don't need them
With the results, create a new table that will hold this unique records

See this SQL fiddle for an attempt to recreate the scenario.

From that fiddle, I'd expect a result like:

Row 1, 2 & Row 11: Only one of them shall be kept as they are exactly the same. Preferably the row with the smallest id.
Row 4 and Row 9: One of them would be dropped as they are exactly the same.
Row 5, 7, & 8: Would be dropped since they are missing either the invoice_number or submitted_by.
The result would then have Row (1, 2 or 11), 3, (4 or 9), 6 and 10.

I don't quite understand what the expected result is. You say you only want one row from a set that has four fields the same but want to get other fields as well. If you don't care which row you get then why does the `id` of that row even matter? — binoternary, Jul 16 '16 at 16:12
And `One of them would be dropped` should be `Only one of them would be kept`. Does it matter which one? — Erwin Brandstetter, Jul 17 '16 at 03:27
@ErwinBrandstetter Thanks for the correction. To answer you, no, doesn't matter which one to be kept. Preferably though, I'd like to keep the one that appears first. — lukik, Jul 17 '16 at 06:54
"Appearing first" is not defined in a database table. You mean the row with the smallest `ìd`? — Erwin Brandstetter, Jul 17 '16 at 12:49

binoternary · Accepted Answer · 2016-07-16T17:11:14.603

2

To get one representative row (with additional fields) from a group with the four distinct fields:

SELECT 
distinct on (
  invoice_number
  , passport_number
  , national_id_number
  , driving_license_number
)
  * -- specify the columns you want here
FROM my_table
where invoice_number is not null
and submitted_by is not null
;

Note that it is unpredictable which row exactly is returned unless you specify an ordering (documentation on distinct)

Edit:

To order this result by id simply adding order by id to the end doesn't work, but it can be done by eiter using a CTE

with distinct_rows as (
    SELECT 
    distinct on (
      invoice_number
      , passport_number
      , national_id_number
      , driving_license_number
      -- ...
    )
      * -- specify the columns you want here
    FROM my_table
    where invoice_number is not null
    and submitted_by is not null
)
select *
from distinct_rows
order by id;

or making the original query a subquery

select *
from (
    SELECT 
    distinct on (
      invoice_number
      , passport_number
      , national_id_number
      , driving_license_number
      -- ...
    )
      * -- specify the columns you want here
    FROM my_table
    where invoice_number is not null
    and submitted_by is not null
) t
order by id;

edited Jul 16 '16 at 17:11

answered Jul 16 '16 at 16:32

binoternary

1,865
1
13
25

The second option is what am looking for. See [this result](http://sqlfiddle.com/#!15/a4568/38). How can I execute this query and order by id? – lukik Jul 16 '16 at 16:53
you can't obtain distinct with id ...in this way beacuse id is different for each row – ScaisEdge Jul 16 '16 at 17:10
my comment was asking for order by. not distinct. let me check your edits.. – lukik Jul 16 '16 at 17:15
The `subquery` option looks like a correct result. For my own curiosity, given that I'll run this against 65 million records, which of your two methods do you think is faster? – lukik Jul 16 '16 at 17:21
@lukik, My expectation is that they are about the same in terms of performance, but I could be wrong. I suggest looking at the [explain plan](https://www.postgresql.org/docs/current/static/sql-explain.html) of both queries to see if one of them has a significantly higher cost. But of course actually running both queries on actual data will give the best answer. – binoternary Jul 16 '16 at 17:27
@lukik: Subqueries are faster. CTEs add materialization overhead. Only use CTEs when you actually need them. The right amount of `work_mem` is crucial for performance [About `DISTINCT ON` and `work_mem`](http://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564). BTW, actually running both queries can still be misleading since the second run profits from prewarmed cache. – Erwin Brandstetter Jul 17 '16 at 03:41
@ErwinBrandstetter, guys, is there a way to get the values of the `id` column for the `duplicated` data? Reason is that once I have the distinct rows and I've processed them, I want to later update the original table with a value that has now been attached to each unique field. See [this question I've posted as a follow up](http://stackoverflow.com/q/38500846/1082673). Thanks. – lukik Jul 21 '16 at 09:59

score 0 · Answer 2 · answered Jul 16 '16 at 20:44

quick way to grab specific fields from this data only where they are unique

I don't think so. I think you mean you want to select a distinct set of rows from a table in which they are not unique.

As far as I can tell from your description, you simply want

SELECT distinct invoice_number, passport_number, 
                driving_license_number, national_id_number
FROM my_table
where invoice_number is not null
and submitted_by is not null;

In your SQLFiddle example, that produces 5 rows.

Get distinct information across many fields some of which are NULL

2 Answers2

Linked