PostgreSQL: Count Number of Occurrences in Columns

Question

BACKGROUND

I have three large tables (employee_info, driver_info, school_info) that I have joined together on common attributes using a series of LEFT OUTER JOIN operations. After each join, the resulting number of records increased slightly, indicating that there are duplicate IDs in the data. To try and find all of the duplicates in the IDs, I dumped the ID columns into a temp table like so:

Original Dump of ID Columns

first_name	last_name	employee_id	driver_id	school_id
Mickey	Mouse	1234	abcd	wxyz
Donald	Duck	2423	heca	qwer
Mary	Poppins	1111	acbe	aaaa
Wiley	Cayote	1234	strf	aaaa
Daffy	Duck	1256	acbe	pqrs
Bugs	Bunny	9999	strf	yxwv
Pink	Panther	2222	zzzz	zzaa
Michael	Archangel	0000	rstu	aaaa

In this overly simplified example, you will see that IDs 1234 (employee_id), strf (driver_id), and aaaa (school_id) are each duplicated at least once. I would like to add a count column for each of the ID columns, and populate them with the count for each ID used, like so:

ID Columns with Counts

first_name	last_name	employee_id	employee_id_count	driver_id	driver_id_count	school_id	school_id_count
Mickey	Mouse	1234	2	abcd	1	wxyz	1
Donald	Duck	2423	1	heca	1	qwer	1
Mary	Poppins	1111	1	acbe	1	aaaa	3
Wiley	Cayote	1234	2	strf	2	aaaa	3
Daffy	Duck	1256	1	acbe	1	pqrs	1
Bugs	Bunny	9999	1	strf	2	yxwv	1
Pink	Panther	2222	1	zzzz	1	zzaa	1
Michael	Archangel	0000	1	rstu	1	aaaa	3

You can see that IDs 1234 and strf each have 2 in the count, and aaaa has 3. After generating this table, my goal is to pull out all records where any of the counts are greater than 1, like so:

All Records with One or More Duplicate IDs

first_name	last_name	employee_id	employee_id_count	driver_id	driver_id_count	school_id	school_id_count
Mickey	Mouse	1234	2	abcd	1	wxyz	1
Mary	Poppins	1111	1	acbe	1	aaaa	3
Wiley	Cayote	1234	2	strf	2	aaaa	3
Bugs	Bunny	9999	1	strf	2	yxwv	1
Michael	Archangel	0000	1	rstu	1	aaaa	3

Real World Perspective

In my real-world work, the JOIN'd table contains 100 columns, 15 different ID fields and over 30,000 records, and the final table came out to be 28 more than the original. This may seem like a small amount, but each of the 28 represent a broken link that we must fix.

Is there a simple way to get the counts populated like in the second table above? I have been wrestling with this for hours already, and have not been able to make this work. I tried some aggregate functions, but they cannot be used in table UPDATE operations.

Tim Biegeleisen · Accepted Answer · 2021-03-01T03:54:03.690

0

The COUNT function, when used as an analytic function, can do what you want here, e.g.

WITH cte AS (
    SELECT *,
           COUNT(employee_id) OVER (PARTITION BY employee_id) employee_id_count,
           COUNT(driver_id)   OVER (PARTITION BY driver_id)   driver_id_count,
           COUNT(school_id)   OVER (PARTITION BY school_id)   school_id_count
    FROM yourTable
)

SELECT *
FROM cte
WHERE
    employee_id_count > 1
    driver_id_count > 1
    school_id_count > 1;

edited Mar 01 '21 at 03:54

answered Mar 01 '21 at 02:17

Tim Biegeleisen

502,043
27
286
360

How can I get this to not match up on NULLs? It is counting NULL values, which is what I do not want/need. – Michael Sheaver Mar 01 '21 at 03:47
@MichaelSheaver Just count the columns themselves (I think this is what you want here). – Tim Biegeleisen Mar 01 '21 at 03:54
@Michael Sheaver add a case when statement for your nulls. Case when the id is null, make the column show 0 else count. This could solve your problem. – T.kowshik Yedida Mar 01 '21 at 03:57
@T.kowshikYedida No, that's not necessary, because the `COUNT` function by default already will not count a `NULL` value. The problem with `COUNT(*)` is that this doesn't apply, because it will always count one for every record. – Tim Biegeleisen Mar 01 '21 at 03:58
What could be the problem asked by OP then, "How can I get this to not match up on NULLs? It is counting NULL values, which is what I do not want/need." I didn't quite get it right, then. – T.kowshik Yedida Mar 01 '21 at 04:03