Inefficiency of join table a to table b on matching sub-string... thoughts?

Question

Background:

First, I have a schema with a table named patients with which I am concerned with patient_id and alerts (alerts is a string of characters, where each character represents some arbitrary value/meaning). Second, each 'patient' is part of a group [family] based solely on the first 6 digits on their patient_id. Also, there are a few third parties that rely on this database; I did not design nor can I change this schema/datamodel and I cannot migrate away from MySQL.

Here is a fiddle with the data model

Challenge:

Now, I need to find occurrences where patients have alerts containing !, @, #, %, ^, or & symbols and their family members do not. My first thought was to gather all patients that have alerts containing these symbols, remove the last digit in each patient_id and then group by this value. Now I have a list of (for all intents and purposes) 'group_ids.' Finally, I need to extend the list to contain each of the groups family members and their respective alerts strings.

Here is what I have so far:

Query #1:

SELECT p.patient_id, p.name_first, p.name_last, p.alerts
FROM patients p
INNER JOIN (SELECT SUBSTRING(patient_id, 1, CHAR_LENGTH(patient_id) - 1) AS group_id
    FROM patients
    WHERE patient_id BETWEEN 1000000 AND 7999999
    AND (alerts like '%!%'
    OR alerts like '%@%'
    OR alerts like '%#%'
    OR alerts like '%\%%'
    OR alerts like '%^%'
    OR alerts like '%&%')
    GROUP BY group_id) g
ON p.patient_id LIKE CONCAT(g.group_id, '%')
ORDER BY p.patient_id
LIMIT 30000;

Fiddle ~ NOTICE: fiddle is not an accurate representation of the problem as included table only has 28 records.

Recordset: 80,000 ~ Results: 2188 ~ Duration: 14.321 sec ~ Fetch: 0.00 sec ~ Total: 14.321 sec

Query #2:

SELECT p.patient_id, p.name_first, p.name_last, p.alerts
FROM patients p
JOIN (SELECT DISTINCT LEFT(patient_id, 6) AS group_id
    FROM patients
    WHERE patient_id BETWEEN 1000000 AND 7999999
    AND alerts REGEXP '[!@#%^&]') g
ON p.patient_id LIKE CONCAT(g.group_id, '%')
ORDER BY p.patient_id
LIMIT 30000;

Fiddle ~ NOTICE: fiddle is not an accurate representation of the problem as included table only has 28 records.

Recordset: 80,000 ~ Results: 2188 ~ Duration: 4.259 sec ~ Fetch: 5.663 sec ~ Total: 9.992 sec

EDIT: after adding name_first, name_last, alerts and order by clause, I found this query took exactly the same amount of time as the first.

Problem:

The list that I get back is accurate but, not only will it take extra processing (I plan to do this with PHP), but also it takes 14 seconds!

If anyone has a better... or can at least point in the direction of a better and more efficient solution, please enlighten me. Thanks in advance.

Extra Credit: any tips on the PHP algo to solve the aforementioned problem with the given data - forget semantics, just a formula will do.

See https://meta.stackoverflow.com/questions/333952/why-should-i-provide-an-mcve-for-what-seems-to-me-to-be-a-very-simple-sql-query - although I don't think schema design is ever going to lend itself to an efficient solution in MySQL. — Strawberry, Sep 15 '17 at 00:14
That was the least interesting part of the accepted answer provided (which is why it was confined to parentheses) — Strawberry, Sep 15 '17 at 00:21
Are you still able to change the datamodel? While using the patient id to group might be something given by outside constraints and thus beyond your control (although you should inform the one who defined it that a patient will not be able to switch groups by e.g. (re-)marrying or a child getting a child on its own), it hurts the join to not have it in a seperate column. And storing the alerts that way will keep hunting you and isn't indexable either. You store first name and last name in two fields. Apply the same logic to whatever "!" or "%" stands for and give them columns (or a subtable). — Solarflare, Sep 15 '17 at 01:05
Can you copy into another table with proper indexes and datamodel? How many patients are in the table? — The Nail, Sep 15 '17 at 06:49
@TheNail I thought about creating a new schema and copying over the data and add triggers but I'm not sure the extra overhead is worth it for one query. Currently, the db holds only 30k patients. — Phillip Weber, Sep 15 '17 at 07:03
@Solarflare I can't change the datamodel (see above) and a patient CAN change groups (sort of) if we chang their patient_id. It would be great if I could add a group_id field with unique index... then patient_id could be completely independent. I am forced to work with this data just the way it is :( — Phillip Weber, Sep 15 '17 at 07:05
Can you at least add a (generated or trigger maintained) column for the group? 14s for 30k rows seems completely out of order by a factor of about 100, just because of a missing index. You can improve your query with a structure like `select * from patients p where (alert-condition) and exists (select * from patients p1 where p1.patient_id like and not (alert-condition for p1))`. It's not entirely clear from "and their family members do not." if *all* or *some* members should (not) have (some of these) alerts, so move the "not" around if it doesn't fit yet. — Solarflare, Sep 15 '17 at 08:25

Phillip Weber · Accepted Answer · 2017-09-16T07:57:47.137

I found an efficient enough solution as follows:

SELECT p.patient_id, name_first, name_last, alerts
FROM patients p
JOIN (SELECT DISTINCT LEFT(patient_id, 6) AS group_id
    FROM patients
    WHERE patient_id BETWEEN 1000000 AND 7999999
    AND alerts REGEXP '[!@#%^&]') g
ON LEFT(p.patient_id, 6) = g.group_id /* HERE is the simple magic */
ORDER BY p.patient_id
LIMIT 30000;

Recordset: 80,000 ~ Results: 2188 ~ Duration: 0.312 sec ~ Fetch: 0.062 sec ~ Total: 0.374 sec

Since we know that legitimate patient_ids are 7 digits long, we can determine a patient's 'group_id' by simply using LEFT(patient_id, 6) instead of the less efficient SUBSTRING(patient_id, 1, CHAR_LENGTH(patient_id) - 1) (which I now see I could've originally written as SUBSTRING(patient_id, 1, 6)). Regardless of the method used here, the real savings is with the change to the ON clause. Instead of comparing patient_id to LIKE CONCAT(group_id, '%'), why not just make a direct = comparison to the LEFT 6 digits of patient_id in Table 'p'?

In other words, the nested select is used to find all unique 'groups' where at least one member has one of the desired alert symbols. The main select uses this table to determine all patients belonging to those groups. Essentially, LEFT(patient_id, 6) is == 'group_id' and we get to keep our index... the only extra overhead is with one LEFT() call per row.

Yet another example of "KISS."

Thank you to everyone for your help!

EDIT: Since I will be using the group_id in my PHP algo, I will add it into the select in an efficient manner:

SELECT g.group_id, RIGHT(p.patient_id, 1) AS sub_id, name_first, name_last, alerts
FROM patients p
JOIN (SELECT DISTINCT LEFT(patient_id, 6) AS group_id
    FROM patients
    WHERE patient_id BETWEEN 1000000 AND 7999999
    AND alerts REGEXP '[!@#%^&]') g
ON LEFT(p.patient_id, 6) = g.group_id
ORDER BY p.patient_id
LIMIT 30000;

HERE is a fiddle! ~ NOTICE: this is not an accurate representation of the solution as included table only has 28 records. See above results on larger dataset.

AND finally, the PHP algo I used to finish processing ~ shoutout to @The Nail:

$cur_group_id = 0;
$members = [];
$symbol = '';
$errs = false;
while($row = $result->fetch_assoc()){
    $row['alerts'] = preg_replace('/[^!@#%^&]+/i', '', $row['alerts']);
    if($row['group_id'] != $cur_group_id){
        if($errs){
            foreach($members as $member => $data){
                printf('<tr><td>%d%d</td><td>%s</td><td>%s</td><td>%s</td></tr>',
                    $data['group_id'],
                    $data['sub_id'],
                    $data['name_last'],
                    $data['name_first'],
                    $data['alerts']);
            }
        }
        /* reset current group */
        $cur_group_id = $row['group_id'];
        $members = array();
        $symbol = $row['alerts'];
        $errs = false;
    }
    $members[] = $row;
    if($row['alerts'] != $symbol || strlen($row['alerts']) > 1){
        $errs = true;
    }
}

Total processing time (including query): .6 sec!!

Funny, I wanted to suggest to change the group id comparison first, to prevent processing on the collected group-ids (see the edits of my SQL post). But then I changed the answer because of '... and their family members do not'. — The Nail, Sep 18 '17 at 07:24

The Nail · Answer 2 · 2017-09-15T08:45:26.747

If you plan to do processing in PHP, and the number of patients is 30k, I would select all records ordered by patient id, iterate over all and process per group:

SELECT * FROM dataminer.patients ORDER BY patient_id;

In PHP, something like this:

$patientsWithRelevantAlert = array();
$currentGroupId = null;

while(... fetch into $row ...) {
    $groupId = extractGroupId($row);

    // Next group? Check relevant patient and reset group info.
    if ($groupId != $currentGroupId) {

        if (count($patientsWithRelevantAlert) == 1) {
            // remember this patient
            ...
        }
        $patientsWithRelevantAlert = array();
        $currentGroupId = $groupId;             
    }

    if(hasRelevantAlert($row)) {
        $patientsWithRelevantAlerts[] = $row;
    }
}

// Don't forget the last group
if (count($patientsWithRelevantAlert) == 1) {
    // remember this patient
    ...
}

That should be fast enough.

That said, SQL, data modeling and indexes were invented for these things.

If this is homework: Make sure you understand the code when you hand it in!

Turns out there are 80,000 patients... a little worried about the memory usage when multiple clients are using this query. If there was some way to cache the results on the webserver without creating a new table... or like you said, adding an indexed column group_id - talking to the db creator now to make sure he is OK with this. — Phillip Weber, Sep 15 '17 at 16:58

The Nail · Answer 3 · 2017-09-15T08:43:16.640

0

If you are only interested in groups with one member, why don't you just select only the groups that have a count(patient_id) = 1?

SELECT g.group_id, MAX(g.patient_id) FROM 
(   SELECT
        SUBSTRING(patient_id, 1, CHAR_LENGTH(patient_id) - 1) AS group_id,
        patient_id,
    FROM dataminer.patients
    WHERE ...

)
GROUP BY group_id
HAVING COUNT(patient_id) = 1

edited Sep 15 '17 at 08:43

answered Sep 15 '17 at 08:24

The Nail

8,355
2
35
48

I am interested in all patients in a group/family where at least one member has an alert containing any of the following symbols: !@#%^& – Phillip Weber Sep 15 '17 at 16:11

Inefficiency of join table a to table b on matching sub-string... thoughts?

3 Answers3