MySQL filter out self-references

Question

We have a table with events (as in calendar event with start and end times) that is regularily queried:

TABLE event (
  `id` varchar(32) NOT NULL,
  `start` datetime,
  `end` datetime,
  `derivedfrom_id` varchar(32),
  `parent_id` varchar(32) NOT NULL
)

The parent_id points to a calendar table that provides some additional information.
Some of the events were created out of another event and hence have a reference pointing to that "origin" event via the derivedfrom_id column.

When retrieving a set of events, we usually query by date (start/end) and calendar (parent_id) and limit the number of results via limit for paging.

The problem we are now facing: sometimes we need to merge related events for the user into a single representation. So we do our normal query

SELECT id, start, parent_id
FROM event
WHERE parent_id in (<list of calendars>)
  AND start >= 'some date'
LIMIT x

... and then filter out the original events, because the derivates have different information and refer to their origins anyways.

As you might have seen (sooner than we did), we do the limit before the filtering and thus receive a set of events with smaller cardinality than what we initially anticipated, i.e. the number of results is lower than 'x' after the filtering.

The only thing I could think of is to duplicate the query and do a sub-select:

SELECT id, start, parent_id
FROM event
WHERE parent_id in (<list_of_calendars>)
  AND start >= 'some date'
  AND (/* the part below duplicates the previous conditions */
        derivedfrom_id is not null
        or id not in (
          SELECT derivedfrom_id
          FROM event
          WHERE parent_id in (<list_of_calendars>)
            AND start >= 'some date'
            AND derivedfrom_id is not null
        )
      )
LIMIT x

But I hardly believe that this is the only way to do this. Especially, since our query is much more complicated.

Is there a better way?

Example Data

(as requested in a comment)

Given these three events:

│ *ID* │ *DERIVEDFROM_ID* │ *PARENT_ID* │ *START*
├──────┼──────────────────┼─────────────┼─────────────────
│ 100  │ -                │ A           │ 2014-11-18 15:00
│ 101  │ 100              │ B           │ 2014-11-18 15:00
│ 150  │ -                │ A           │ 2014-11-20 08:00

... and a limit of 2, I want to get events 101 and 150.

Instead, with the current approach:

The query with a limit of 2 results in events 100 and 101
After filtering, event 100 is discarded and the only remaining event is 101

Note on Expected Answer

The SQL above is actually generated from a Java application that uses JPA. My current solution is to generate a where clause and duplicate it. If there is something generic JPA-specific, I would appreciate any pointers.

Some sample data and desired results would help clarify the relationships — Gordon Linoff, Nov 18 '14 at 15:43
If the filtering logic is not in the same SQL there is no way you can achieve the desired number of results. What is the filtering logic? — Pradeep, Nov 18 '14 at 16:50
derived_id is null is the logic? If it is a different subselect can you post it? — Pradeep, Nov 18 '14 at 19:49
So apparently you need some rows to be both returned and used as a reference to exclude other rows. Doesn't seem possible to me without repetition of at least some part of the logic. I can see one way to *slightly* simplify your current version, but depending on the complexity of the real query it might still not be good enough. — Andriy M, Nov 24 '14 at 19:02
@ForguesR, unfortunately no. The idea is to remove duplicates using the _derivedfrom_id_ property. — Kariem, Nov 28 '14 at 15:56
@fthiella, good question! No, in the current application design we only have one level of 'derivation'. — Kariem, Nov 28 '14 at 15:57
Can an event be related to another related event (like a chain of related events) or it is limited to 1? — ForguesR, Nov 28 '14 at 16:45
@ForguesR already answered that in response to fthiella's question: The 'chain of related events' is at most one level deep. — Kariem, Nov 28 '14 at 16:48
about your example: you are searching for calendar A and B, and you are filtering out row 100 because row 101 is already present? And what if you want to search for just A? You want to return 100 and 150? — fthiella, Nov 28 '14 at 18:43
@fthiella, yes on both questions. Regarding the latter: if we only query on calendar A, we primarily want information regarding events on calendar A, so we only return the two events 100 and 150 as you have indicated. If more is necessary, another query is acceptable. — Kariem, Nov 28 '14 at 20:22
Have you tried your approach anyway? I think there's a discrepancy between the verbal description and the implemented logic in your query, because your query would actually filter out *derivatives* and keep their originals. It would probably keep derivatives of events that didn't match other criteria (e.g. didn't match the date range). Perhaps you actually wanted `id not in (SELECT derivedfrom_id ...)` instead of `derivedfrom_id not in (SELECT id ...)`, although you would then need to filter out NULLs in the subquery. — Andriy M, Nov 29 '14 at 01:54
I agree with @ForguesR, if you are interested in **any** x records of a query, you can omit `order by`, but if you prefer some records over others, `order by` is mandatory. — Saic Siquot, Dec 01 '14 at 12:40
@AndriyM Yes! You are right. I did not notice this while copying. I have just updated it. Thank you! — Kariem, Dec 02 '14 at 23:33
Maybe I did not understand your question, but the answer you accepted seems convoluted to me. If there is another event added to the above list, having id = 102 and derivedfrom_id = 100, what should be the output of the query? (101, 150) or (101, 102, 150)? Or maybe (102, 150)? — axiac, Dec 03 '14 at 15:29

score 4 · Accepted Answer · answered Dec 02 '14 at 13:57

Try this:

SELECT e.*
FROM `event` e            # 'e' from 'event'
  LEFT JOIN `event` d     # 'd' from 'derived'; `LEFT JOIN` gets ALL entries from `e`
    ON e.id = d.derivedfrom_id    # match an event `e` with all those `d` derived from it
WHERE d.id IS NULL        # keep only events `e` without derived events `d`
;

The LEFT JOIN selects all events from e and pairs them with the events d that derive from them. It ensures all the entries from e have the chance to be selected, no matter if they have derived events or not. The WHERE clause keeps only the events from e that do not have derived events. It keeps the derived events and also the original events that do not have derived events but strips out those original events that have derived events.

Add additional WHERE conditions on fields of table e as you wish, use a LIMIT clause, stir well, serve cold.

I did not consider this elegant solution before. I have just started another bounty and will assign it to this answer. Thank you! — Kariem, Dec 03 '14 at 16:51
It's not about bounty or points. If it helps you, you're welcome! — axiac, Dec 03 '14 at 16:54
It's sure not about points, but the best answer should get the bounty. — Kariem, Dec 05 '14 at 16:48

score 3 · Answer 2 · edited May 23 '17 at 12:31

I suggest to group the events by their DERIVEDFROM_ID or - if it is not a derived event their ID using MySQL's IFNULL method, see SELECT one column if the other is null

SELECT id, start, parent_id, text, IFNULL(derivedfrom_id, id) as grouper
FROM event
WHERE parent_id in (<list_of_calendars>)
    AND start >= '<some date>'
GROUP BY grouper
LIMIT <x>

This would however return randomly the original or a derived event. If you want to get only derived events you'd have to sort your results by ID before grouping (assuming the IDs are ascending and derived events have thus higher IDs than their ancestor). Because it's not possible to run a ORDER BY before a GROUP BY in MySQL you'll have to ressort to an inner join (MySQL order by before group by):

SELECT e1.* FROM event e1
INNER JOIN
(
    SELECT max(id) maxId, IFNULL(derivedfrom_id, id) as grouper
    FROM event
    WHERE parent_id in (<list_of_calendars>)
        AND start >= '<some date>'
    GROUP BY grouper
) e2
on e1.id = e2.maxId
LIMIT <x>

edit: As pointed out by Aaron the assumption of ascending IDs is in conflict with the given data structure. Assuming there is a timestamp created you could use a query like this:

SELECT e1.* FROM event e1
INNER JOIN
(
    SELECT max(created) c, IFNULL(derivedfrom_id, id) grouper
    FROM event
    WHERE parent_id IN (<list_of_calendars>)
        AND start >= '<some date>'
    GROUP BY grouper
) e2
ON (e1.id = e2.grouper AND e1.created = c) OR (e1.derivedfrom_id = e2.grouper AND e1.created = c)
LIMIT <x>

SQL Fiddle

This solution is dependent on the derivedfrom_id being larger than the id. Given they are Varchar(32)'s I wouldn't necessarily assume that is the case. — Aaron, Dec 02 '14 at 19:53
Yes, you are correct. My assumption was based on the example data. If this assumption doesn't hold my proposed solution would have to rely on some `created` timestamp. I'll update my answer accordingly. — Rupert Angermeier, Dec 03 '14 at 11:00

Devarsh · Answer 3 · 2014-11-28T17:32:28.203

0

Are looking for something like this ::

Select a.id, a.start, a.parent_id from 
event a , event b
Where a.parent_id in (<list_of_calendars>)
And a.start >= 'some date'
And b.parent_id = a.parent_id
And b.start = a.start
And a.id != b.derivedfrom_id
Limit x

edited Nov 28 '14 at 17:32

answered Nov 28 '14 at 16:03

Devarsh

116
3

Thank you. This works, but except for refactoring the sub-select into the main query this is the same as my current working query, including the duplication of constraints, i.e. `parent_id in ...` and `start >= ...`. – Kariem Nov 28 '14 at 16:49

score 0 · Answer 4 · answered Dec 01 '14 at 23:21

to omit those events that have derived events in the result set, you can test each id whether to omit it or not, or join on a derived table of the ids to exclude

join:

SELECT id, start, parent_id 
  FROM event
  LEFT JOIN (
    SELECT DISTINCT derived_id AS id FROM event
     WHERE start >= 'some date' AND parent_id IN (<calendars>)
  ) omit
    ON omit.id = event.id
 WHERE parent_id IN (<calendars>)
   AND start >= 'some date'
   AND omit.id IS NULL
 LIMIT x

nested select: reasonably efficient if if derived_id is indexed

SELECT e.id, e.start, e.parent_id
  FROM event e
  WHERE parent_id IN (<calendars>)
    AND start >= 'some date'
    AND (SELECT e2.id FROM event e2      /* and does not have derived events */
          WHERE e2.derived_id = e.id
            AND e2.start >= 'some date'
          LIMIT 1) IS NULL
  LIMIT x

in mysql you can't test for the negative, you have to build up the list of excludes and omit the explicitly

Since the parent_id (calendar) can vary, all selects must test it. Checking the start does not have to be duplicated if we can assume that no derived event can occur before its original event.

Note that you refer to filtering out the original event (id 100, because it has derived event 101), but I think your example nested select is filtering out the derived event.

score 0 · Answer 5 · answered Dec 03 '14 at 03:42

Assuming that parent_id value in the 'derivative' row matches the parent_id value on the 'origin' row, and that the start value on the derivative row is guaranteed to not be earlier than the start on the parent row... (Those are assumptions, because I don't believe either of these was specified) ... then ...

One quick fix would be to add a "NOT EXISTS" predicate to the existing query. We'd just assign an alias to the table reference in the original query (e.g. e), and then add to the WHERE clause...

   AND NOT EXISTS (SELECT 1 FROM event d WHERE d.derivedfrom_id = e.id)

To explain that a little bit... for an 'origin' row, the subquery will find a matching 'derivative' row, and when that row is found, the 'origin' row will be excluded from the resultset.

Back to those assumptions... if we don't have a guarantee about the parent_id matching in the 'origin' and 'derivative' row... and/or we don't have a guarantee about the start value, then we'd need to repeat the appropriate predicates (on parent_id and start) in the correlated subquery to check whether a matched 'derivative' row would be returned or not, the addition of the predicates makes the query appear to more complex:

   AND NOT EXISTS ( SELECT 1
                      FROM event d
                     WHERE d.derivedfrom_id = e.id 
                       AND d.parent_id IN parent_id IN (<list of calendars>)
                       AND d.start > 'some date' 
                  )

Sometimes, we can get better performance by rewriting the query to replace the NOT EXISTS with an equivalent "anti-join" pattern.

To describe that, it's an "outer join" to find matching 'derivative' rows, and then filtering out the rows that had at least one matching 'derivative' row(s).

Personally, I think the NOT EXISTS form is more intuitive, the anti-join pattern is a little more obfuscated. The benefit of the anti-join is better performance (in some cases).

As an example of an anti-join pattern, I'd re-write the query something like this:

SELECT e.id
     , e.start
     , e.parent_id
  FROM event e
  LEFT
  JOIN event d
    ON d.derivedfrom_id = e.id
   AND d.parent_id IN (<list of calendars>)
   AND d.start >= 'some date'
 WHERE d.derivedfrom_id IS NULL
   AND e.parent_id IN (<list of calendars>)
   AND e.start >= 'some date'
 ORDER BY e.id
 LIMIT x

To unpack that a bit.. the LEFT [OUTER] JOIN operation finds matching 'derivative' rows, which would return rows from e that have matching 'derivative' rows, along with rows from e that don't have a match. The "trick" is the IS NULL condition on a column that is guaranteed to be non-NULL when a matching derivative row was found, that predicate excludes rows that found a match.

(I also added an ORDER BY clause just to make the result a little more deterministic.)

MySQL filter out self-references

Example Data

Note on Expected Answer

5 Answers5