BACKGROUND
I have three large tables (employee_info, driver_info, school_info) that I have joined together on common attributes using a series of LEFT OUTER JOIN
operations. After each join, the resulting number of records increased slightly, indicating that there are duplicate IDs in the data. To try and find all of the duplicates in the IDs, I dumped the ID columns into a temp table like so:
Original Dump of ID Columns
first_name | last_name | employee_id | driver_id | school_id |
---|---|---|---|---|
Mickey | Mouse | 1234 | abcd | wxyz |
Donald | Duck | 2423 | heca | qwer |
Mary | Poppins | 1111 | acbe | aaaa |
Wiley | Cayote | 1234 | strf | aaaa |
Daffy | Duck | 1256 | acbe | pqrs |
Bugs | Bunny | 9999 | strf | yxwv |
Pink | Panther | 2222 | zzzz | zzaa |
Michael | Archangel | 0000 | rstu | aaaa |
In this overly simplified example, you will see that IDs 1234
(employee_id), strf
(driver_id), and aaaa
(school_id) are each duplicated at least once. I would like to add a count column for each of the ID columns, and populate them with the count for each ID used, like so:
ID Columns with Counts
first_name | last_name | employee_id | employee_id_count | driver_id | driver_id_count | school_id | school_id_count |
---|---|---|---|---|---|---|---|
Mickey | Mouse | 1234 | 2 | abcd | 1 | wxyz | 1 |
Donald | Duck | 2423 | 1 | heca | 1 | qwer | 1 |
Mary | Poppins | 1111 | 1 | acbe | 1 | aaaa | 3 |
Wiley | Cayote | 1234 | 2 | strf | 2 | aaaa | 3 |
Daffy | Duck | 1256 | 1 | acbe | 1 | pqrs | 1 |
Bugs | Bunny | 9999 | 1 | strf | 2 | yxwv | 1 |
Pink | Panther | 2222 | 1 | zzzz | 1 | zzaa | 1 |
Michael | Archangel | 0000 | 1 | rstu | 1 | aaaa | 3 |
You can see that IDs 1234
and strf
each have 2 in the count, and aaaa
has 3. After generating this table, my goal is to pull out all records where any of the counts are greater than 1, like so:
All Records with One or More Duplicate IDs
first_name | last_name | employee_id | employee_id_count | driver_id | driver_id_count | school_id | school_id_count |
---|---|---|---|---|---|---|---|
Mickey | Mouse | 1234 | 2 | abcd | 1 | wxyz | 1 |
Mary | Poppins | 1111 | 1 | acbe | 1 | aaaa | 3 |
Wiley | Cayote | 1234 | 2 | strf | 2 | aaaa | 3 |
Bugs | Bunny | 9999 | 1 | strf | 2 | yxwv | 1 |
Michael | Archangel | 0000 | 1 | rstu | 1 | aaaa | 3 |
Real World Perspective
In my real-world work, the JOIN'd table contains 100 columns, 15 different ID fields and over 30,000 records, and the final table came out to be 28 more than the original. This may seem like a small amount, but each of the 28 represent a broken link that we must fix.
Is there a simple way to get the counts populated like in the second table above? I have been wrestling with this for hours already, and have not been able to make this work. I tried some aggregate functions, but they cannot be used in table UPDATE operations.