3

My question might be similar to these questions:

Table Example:

source  zip     phone   street              city        created_at
==================================================================
a       11111   11111   Flatlands           null        2015-01-01
b       11111   11111   Flatlands Avenue    New York    2015-01-01
c       11111   11111   Ave Flatlands       New York    2015-01-01
a       22222   22222   Favory              New York    2015-01-01
b       22222   22222   Favory Avenue       New York    2017-12-12
c       22222   22222   Ave Favory          New York    2015-01-01
b       33333   33333   Sixteenth           Washington  2015-01-01
c       33333   33333   st. 16th            null        2015-01-01
c       44444   44444   st. West Land       null        2015-01-01

Let's say I have a table with information about places in different cities. The information is gathered from 3 different sources: a, b, and c.

zip and phone fields uniquely identify the location, so the rows in the DB can be grouped by these fields.

I need to merge information about various locations from various sources selecting the best value based on the set of rules for street and city columns.

The rules are:

  1. For each group prioritize non empty values of street and city over null values.
  2. In each group prioritize values of street and city columns from a and b sources over c source (weight(a) = weight(b) > weight(c)) if these columns are not empty.
  3. For sources a and b prioritize column values from rows which have latest created_at timestamp.

Here is the result I want to receive:

zip     phone   street          city
====================================
11111   11111   Flatlands       New York
22222   22222   Favory Avenue   New York
33333   33333   Sixteenth       Washington
44444   44444   st. West Land   null

Here is a DB Fiddle to play with.

I'm not sure this is possible to achieve with SQL and maybe my best option is to switch to NoSQL DB + imperative processing task. Or just use some tool to extract information from the DB and then process it.

P.S. This is a simplified example.

Machavity
  • 30,841
  • 27
  • 92
  • 100
sorjef
  • 554
  • 5
  • 15
  • 1
    The first and second criteria can be verified in a `GROUP BY` query but the third one cannot. You **cannot** select rows using `GROUP BY`. `GROUP BY` computes new rows using the data from each group. You **cannot** select rows using `GROUP BY`. `GROUP BY` computes new rows using the data from each group. Take a look at [this answer](https://stackoverflow.com/a/28090544/4265352) on a [similar (but much simpler) question](https://stackoverflow.com/q/12102200/4265352). – axiac Sep 06 '17 at 15:00
  • 1
    I would not write a monster query (with or without `GROUP BY`) to run every time I need to get values from the table. I would treat this as a draft table and I would create another table to store one row for each (`zip`, `phone`) pair with the other fields curated using whatever criteria you need. The scripts that need the data will read from the "clean" table; another script would use the "draft" table to compute the "clean" records. This one can run periodically (and process all the rows) or only when new rows are added to the "draft" (and re-compute only the affected "clean" records). – axiac Sep 06 '17 at 15:06
  • In another DBMS you would simply use `ROW_NUMBER` with an appropriate partition and sort order to rank the records. As MySQL is lacking `ROW_NUMBER`, look up how to emulate `ROW_NUMBER` in MySQL with variables. – Thorsten Kettner Sep 06 '17 at 15:22

1 Answers1

0

You can use the following query to implement the prioritization rules for street:

SELECT zip, phone, street
FROM test 
ORDER BY zip, phone,
        -- prioritize non empty values over null values
         CASE 
            WHEN (street IS NOT NULL) OR (street = '') THEN 0
            ELSE 1
         END,
         -- prioritize a, b over c
         CASE 
            WHEN source IN ('a', 'b') THEN 0
            ELSE 1
         END,
         -- prioritize rows which have the latest created_at
         created_at DESC 

A similar query can be used for city field.

Then you can emulate ROW_NUMBER, unfortunately not available in MySQL, using variables:

SELECT zip, phone, street,
       @seq := IF(@id = CONCAT(zip,phone), @seq + 1,
                  IF(@id := CONCAT(zip,phone), 1, 1)) AS seq
FROM test 
CROSS JOIN (SELECT @seq := 0, @id = '') AS v
ORDER BY zip, phone,
        -- prioritize non empty values over null values
         CASE 
            WHEN (street IS NOT NULL) OR (street = '') THEN 0
            ELSE 1
         END,
         -- prioritize a, b over c
         CASE 
            WHEN source IN ('a', 'b') THEN 0
            ELSE 1
         END,
         -- prioritize rows which have the latest created_at
         created_at DESC 

Again, a similar query can be used for city field.

The desired result is obtained by joining the above derived tables ON zip, street and seq = 1.

Giorgos Betsos
  • 71,379
  • 9
  • 63
  • 98