1

I have two tables of service providers, providers and providers_clean. providers contains many thousands of providers with very poorly formatted data, providers_clean only has a few providers which still exist in the 'dirty' table as well.

I want the system using this data to remain functional while the user is 'cleaning' the data up, so I'd like to be able to select all of the rows that have already been 'cleaned' and the rows that are still 'dirty' while excluding any 'dirty' results that have the same id as the 'clean' ones.

How can I select all of the providers from the providers_clean table merged with all of the providers from the providers table, and EXCLUDE the ones that have already been 'cleaned'

I've tried:

SELECT * FROM providers WHERE NOT EXISTS (SELECT * FROM providers_clean WHERE providers_clean.id = providers.id)

which gives me all of the 'dirty' results from providers EXCLUDING the 'clean' ones, but how can I rewrite the query to now merge all of the 'clean' ones from providers_clean?

Here's a visual representation of what I'm trying to do:

Clean Table

+----+-------------------+
| ID |       Name        |
+----+-------------------+
|  1 | Clean Provider 1  |
|  4 | Clean Provider 4  |
|  5 | Clean Provider 5  |
+----+-------------------+


Dirty Table
+----+------------------+
| ID |       Name       |
+----+------------------+
|  1 | Dirty Provider 1 |
|  2 | Dirty Provider 2 |
|  3 | Dirty Provider 3 |
|  4 | Dirty Provider 4 |
|  5 | Dirty Provider 5 |
+----+------------------+


Desired Result

+----+------------------+
| ID |       Name       |
+----+------------------+
|  1 | Clean Provider 1 |
|  2 | Dirty Provider 2 |
|  3 | Dirty Provider 3 |
|  4 | Clean Provider 4 |
|  5 | Clean Provider 5 |
+----+------------------+

Thanks

UPDATE

This is working, however, Is there a more efficient way to write this query?

SELECT providers.id AS id, 
CASE 
  WHEN 
   providers_clean.id IS NOT NULL 
  THEN 
   providers_clean.provider_name
  ELSE 
   providers.provider_name
END AS pname,

CASE 
  WHEN 
   providers_clean.id IS NOT NULL 
  THEN 
   providers_clean.phone
  ELSE 
   providers.phone
END AS pphone,

CASE 
  WHEN 
   providers_clean.id IS NOT NULL 
  THEN 
   providers_clean.website
  ELSE 
   providers.website
END AS pwebsite

FROM providers
  LEFT JOIN providers_clean ON providers_clean.id = providers.id
ORDER BY providers.id asc
WheatBeak
  • 1,036
  • 6
  • 12
  • This is what my query essentially does. `COALESCE` offers a more succinct syntax in place of `CASE` expressions. I doubt it can get more efficient than that: its just a simple `LEFT JOIN` operation. – Giorgos Betsos Mar 01 '16 at 21:10
  • But I don't want to use `COALESCE` for any field other than ID. In other words, even if the name, phone, website is `NULL` on the 'clean' table, I want the result to be `NULL`, not the value on the 'dirty' table. That's where I'm having the issue. – WheatBeak Mar 01 '16 at 21:15
  • Just to explain why this matters. Say the provider has no website, I want the value in the 'clean' table to be `NULL` however, on the 'dirty' table, the user might have entered the string `'none'` – WheatBeak Mar 01 '16 at 21:17

3 Answers3

1

Seems like a LEFT JOIN is what you need:

SELECT COALESCE(pc.ID, p.ID), COALESCE(pc.Name, p.Name)
FROM providers AS p
LEFT JOIN providers_clean AS pc ON p.ID = pc.ID

What this query essentially does: if the record exists in the 'clean' table then select this one, otherwise select the one from the 'dirty' table.

Giorgos Betsos
  • 71,379
  • 9
  • 63
  • 98
  • This is partially working, however, if a column on the 'clean' table is NULL, the result will use the value from the 'dirty' table. I'm guessing this has something to do with COALESCE but I don't really understand too well how that works. – WheatBeak Mar 01 '16 at 20:41
  • @WheatBeak I thought this is the intended result: *gives me all of the 'dirty' results from providers EXCLUDING the 'clean' ones* – Giorgos Betsos Mar 01 '16 at 21:04
1

I love and often refer to this and this visual explanation of how JOINs work.

According to them you need a FULL OUTER JOIN excluding the items that are in both tables ("Outer Excluding JOIN"):

SELECT *
  FROM providers p
  FULL OUTER JOIN providers_clean pc
       ON pc.id = p.id
 WHERE p.id IS NULL OR pc.id IS NULL;

Update Unfortunately there's no FULL OUTER JOIN in MySQL so you have to emulate it. I used this answer to do that:

select p.*
  from providers p left join providers_clean pc on pc.id = p.id
  where pc.id is null
union all
select pc.* 
  from providers p right join providers_clean pc on pc.id = p.id;

The first SELECT are the dirty ones that have no clean counterpart and the second SELECT are simply the clean ones.

Community
  • 1
  • 1
PerlDuck
  • 5,610
  • 3
  • 20
  • 39
1

You need to do an outer join from Dirty to Clean (since Dirty has all rows Clean has, but not vice versa)

SELECT dirty.id AS id, 
CASE 
  WHEN clean.id IS NOT NULL THEN clean.name
  ELSE dirty.name
END AS new_name
FROM dirty
  LEFT JOIN clean ON clean.id = dirty.id
ORDER BY dirty.id asc

Example

Lloyd Banks
  • 35,740
  • 58
  • 156
  • 248
  • I think this is on the right track but how would I get all of the columns on each row, not just the name? – WheatBeak Mar 01 '16 at 20:49
  • @WheatBeak Are you trying to get all of the columns from both tables on each row? Just replace `dirty.id...END AS new_name` after SELECT with `*` – Lloyd Banks Mar 01 '16 at 20:51
  • not quite, that returns one big row with all of the info. What I'm trying to get is pretty much what your answer gave me, however also including other columns like 'phone', 'website', which exist in both tables. I guess maybe there has to be a `CASE` for each column? – WheatBeak Mar 01 '16 at 20:58
  • @WheatBeak If my assumption is correct, then all of your info (phone, website, etc..) exist on both tables and are of the same value on both tables. And since all records in Clean are also found in Dirty, then you'll only ever have to pull these extra columns from the Dirty table. – Lloyd Banks Mar 01 '16 at 21:04
  • The data might be different from 'dirty' to 'clean', so if the ID exists at all in the 'clean' table then I want all the columns from the 'clean' table for that given ID and none of the data from the 'dirty' table. I've updated my question using a query made from your answer that gives me my desired result, however, I was wondering if there was a better way to write that query?? – WheatBeak Mar 01 '16 at 21:08
  • @WheatBeak No, not with current MySQL syntax – Lloyd Banks Mar 01 '16 at 22:12