0

I was wondering if any of you could help me to improve this query

SELECT IF(cases.country_region LIKE '%Korea%', 'South Korea', IF(upper(cases.country_region) = 'IRAN (ISLAMIC REPUBLIC OF)', 'Iran',  
        IF(upper(cases.country_region) = 'REPUBLIC OF IRELAND', 'IRELAND', 
        IF(cases.country_region = 'United Kingdom', 'UK', IF(upper(cases.country_region) = 'REPUBLIC OF MOLDOVA', 'MOLDOVA', cases.country_region))))) as country, (SUM(cases.latitude)/COUNT(cases.latitude)) as latitude, 
        (SUM(cases.longitude)/COUNT(cases.longitude)) as longitude, SUM(case when cases.confirmed is null then 0 else cases.confirmed end) as total_confirmed, 
        SUM(case when cases.deaths is null then 0 else cases.deaths end) as total_deaths, SUM(case when cases.recovered is null then 0 else cases.recovered end) as total_recovered, 
        SUM(case when cases.active is null then 0 else cases.active end) as total_active_cases, MAX(cases.date) as last_update
        FROM
        `bigquery-public-data.covid19_jhu_csse.summary` cases
        INNER JOIN (
            SELECT c.country_region, MAX(c.date) as maxdate
            FROM    `bigquery-public-data.covid19_jhu_csse.summary` c
            WHERE c.date <= '2020-05-07'
            GROUP BY c.country_region
        ) lcases ON cases.country_region = lcases.country_region AND cases.date = lcases.maxdate
        GROUP BY country
        HAVING total_confirmed > 0
        ORDER BY total_confirmed desc;

I don't really know if there is any way to simplify the first IF(case) part.

If someone has any idea please comment below! Thank you very much!

  • also look at window functions if using MySQL-8.0/MariaDB-10.2+ – danblack May 07 '20 at 04:53
  • "North Korea" becomes "South Korea"? Are you sure shortening country names is the right thing? If so be careful of the patterns. – danblack May 07 '20 at 04:55
  • The query doesn't look correct. In most cases you treat `country_region` as a country, take its last record and then sum up its cases. As you should get one row per country, the sums should just be what the row contains. The aggregation seems unnecessary, if not dangerous in case a day is being reported twice by mistake. On the other hand you seem to fear to find 'REPUBLIC OF IRELAND' and 'Republic Of Ireland'. In that case you take the latest row *for each of these* and sum them up. Shouldn't you just take only the one last row per country? – Thorsten Kettner May 07 '20 at 05:03
  • Seriously consider handling issues of data display in application code – Strawberry May 07 '20 at 05:43
  • @danblack I know, it's just that the db I'm using I've already checked and has no data of North Korea, every data is about South Korea that's why I used that pattern I know this query I made is horrible but it is what it... I just have an external db to get the info I wanted and I'm trying to clean the data – Ariel Kenji Arakaki May 08 '20 at 11:41

4 Answers4

0

You can use CASE as below in the SELECT statement-

SELECT
CASE 
    WHEN cases.country_region LIKE '%Korea%' THEN 'South Korea'
    WHEN upper(cases.country_region) = 'IRAN (ISLAMIC REPUBLIC OF)' THEN 'Iran'
    WHEN upper(cases.country_region) = 'REPUBLIC OF IRELAND' THEN 'IRELAND'
    WHEN cases.country_region = 'United Kingdom' THEN 'UK'
    WHEN upper(cases.country_region) = 'REPUBLIC OF MOLDOVA' THEN 'MOLDOVA'
    ELSE cases.country_region
END AS country
FROM ...

You can also use the same line of codes in the GROUP BY section if required.

mkRabbani
  • 16,295
  • 2
  • 15
  • 24
0

If you want a simplified set of names, put these into their own table and left join on these. I can see your list growing.

Use COALESE(simple.name, cases.country_region) to fall back to the country_region if there isn't a simple name.

danblack
  • 12,130
  • 2
  • 22
  • 41
0

It is sometimes better to make an extra pass to clean up the data. Else GIGO -- Garbage In, Garbage Out.

In a similar list, I found all these problems:

Mainland China VS China
occupied Palestinian territory VS Palestine
Iran (Islamic Republic of) VS Iran
Republic of Korea VS South Korea
Korea, South VS South Korea
Congo (Kinshasa) VS Congo
Congo (Brazzaville) VS Republic of the Congo
Bahamas, The VS The Bahamas
Hong Kong SAR VS Hong Kong
Gambia, The VS Gambia
Taiwan* VS Taiwan
 Azerbaijan VS Azerbaijan

Also, check line endings -- CRLF vs LF. And BOM.

Rick James
  • 135,179
  • 13
  • 127
  • 222
0

You can simplify

case when cases.active is null then 0 else cases.active end

down to

COALESCE(cases.active, 0)
Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • That could make the query shorter, thank you very much!! :) – Ariel Kenji Arakaki May 08 '20 at 11:43
  • You're welcome. Even more important than shorter, it's less error-prone. What if you had `case when cases.active is null then 0 else cases.death end`? It's valid, but it's wrong, and it might be hard to see when you have a bunch of those stacked together. Any time you can eliminate usage of a variable, that's usually a good thing. – Andy Lester May 08 '20 at 14:02