Performing a combination of splitting, substring searching, and counting/summing in one SQL query

Question

I am working with a MySQL database table that I do not own (i.e., I cannot change the layout/schema). One of the fields/columns has data that looks like this (5 records shown):

12345 - descriptive text
98765 - descriptive text
12345 - descriptive text; 3 x 67890 - descriptive text
12345 - descriptive text; 67890 - descriptive text
12345 - descriptive text; 2 x 67890 - descriptive text

In each record, this field contains one or two distinct 5-digit codes. I need to sum all instances of each specific 5-digit code. As you can see in the example above however, some records have more than one instance of a 5-digit code with the second 5-digit code optionally preceded by a multiplier. That said, based on the example, the 5-digit Code totals would be as follows:

5-digit Code	Count
12345	4
98765	1
67890	6

Since certain 5-digit codes may only appear by themselves in a record (not two different 5-digit codes delimited by a semicolon) and I have a list of these codes, I am able to sum the instances of these codes using the following SQL snippet for each specific code:

SELECT
    COUNT(*),
    SUM(
        CASE
            WHEN code LIKE '%12345%'
        THEN 1
        ELSE 0
    END
) as 'count_12345', ...

The above code works great when there is only one 5-digit code per record. Unfortunately, it gets really ugly when there are two codes in the column delimited by a semicolon. If I was pre/post processing this field in code (C#, PHP, et al), I would simply split the field on the "; " and then parse the second substring to find the optional multiplier. As a relative SQL newbie however, I'm not sure how to go about performing similar string tests/manipulations, accompanied by the appropriate branching logic. Conceptually I understand how to do what I want but syntactically I have yet to find the right combination of SQL commands that yield the desired result when I execute a test query in phpMyAdmin (the environment doesn't provide much, if any, feedback when I enter invalid code).

Does each value part which is obtained by splitting the value by `;` and trimming matches the pattern `^(\d+ x )?\d{5}\D.*$`? — Akina, Aug 30 '23 at 18:59
please read https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad and change your design — nbk, Aug 30 '23 at 19:00
I would parse data in an external app, instead of using SQL for this. — The Impaler, Aug 30 '23 at 19:22
Akina - yes, if the field has two parts (i.e. delimited by "; "), it does match the regular expression pattern ^(\d+ x )?\d{5}\D.*$. — LloydC, Aug 30 '23 at 19:34
nbk - The database I have been asked to work with is a government-generated dataset - I have no control over its content or structure. That said, I have considered building a second database that leverages the content of the first whereby all data gets normalized as part of the export/import process. While I would prefer to work with this normalized version of the original content, I am reluctant to do so only because it adds multiple steps to what will be a regularly recurring process (the "customer" releases new datasets monthly, and sometimes, weekly). — LloydC, Aug 30 '23 at 19:43

score 0 · Accepted Answer · answered Aug 30 '23 at 19:40

You can use a query like this:

SELECT MAX(no),SUM(cnt)
FROM (
  SELECT 
  REGEXP_SUBSTR(val, '^[0-9]{5}') AS no, 1 as cnt
  FROM 5recs
  UNION ALL
  SELECT
  SUBSTRING_INDEX(REGEXP_SUBSTR(val, '; [0-9]{1,2} x [0-9]{5}',1,1), ' ', -1) 
  ,SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR(val, '; [0-9]{1,2} x [0-9]{5}',1,1), ' ', 2),' ',-1)
  FROM 5recs
  UNION ALL
  SELECT
  SUBSTRING_INDEX(REGEXP_SUBSTR(val, '; [0-9]{5}',1,1), ' ', -1) 
  , 1
  FROM 5recs
) as recs
WHERE  no is not NULL
GROUP BY no;

sample

mysql> SELECT * FROM 5recs;
+----+--------------------------------------------------------+
| id | val                                                    |
+----+--------------------------------------------------------+
|  1 | 12345 - descriptive text                               |
|  2 | 98765 - descriptive text                               |
|  3 | 12345 - descriptive text; 3 x 67890 - descriptive text |
|  4 | 12345 - descriptive text; 67890 - descriptive text     |
|  5 | 12345 - descriptive text; 2 x 67890 - descriptive text |
+----+--------------------------------------------------------+
5 rows in set (0.00 sec)

mysql> SELECT MAX(no),SUM(cnt)
    -> FROM (
    ->   SELECT 
    ->   REGEXP_SUBSTR(val, '^[0-9]{5}') AS no, 1 as cnt
    ->   FROM 5recs
    ->   UNION ALL
    ->   SELECT
    ->   SUBSTRING_INDEX(REGEXP_SUBSTR(val, '; [0-9]{1,2} x [0-9]{5}',1,1), ' ', -1) 
    ->   ,SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR(val, '; [0-9]{1,2} x [0-9]{5}',1,1), ' ', 2),' ',-1)
    ->   FROM 5recs
    ->   UNION ALL
    ->   SELECT
    ->   SUBSTRING_INDEX(REGEXP_SUBSTR(val, '; [0-9]{5}',1,1), ' ', -1) 
    ->   , 1
    ->   FROM 5recs
    -> ) as recs
    -> WHERE  no is not NULL
    -> GROUP BY no;
+---------+----------+
| MAX(no) | SUM(cnt) |
+---------+----------+
| 12345   |        4 |
| 98765   |        1 |
| 67890   |        6 |
+---------+----------+
3 rows in set (0.10 sec)

mysql>

Performing a combination of splitting, substring searching, and counting/summing in one SQL query

1 Answers1