I'm working on a system that receives and parses emails that are formatted in a structured way and saves the information to a MySQL database. Each email describes a specific event. However, sometimes a single event can generate multiple emails because a new email is sent when the event details are updated by a dispatcher. This can happen multiple times over the course of a few minutes as new information is added.
I'm trying to write a query that will select only the most recent row that describes a given event.
Below is an example of what the table might look like. Note that EventID
's 3
and 4
refer to the same event. Event 4
is simply a more recently received email with more details. A human can quickly look at the address, event type, and received times, and infer that they refer to the same event. The question is: how can I translate this into a query with some reasonable accuracy?
EventID EventReceived EventAddress EventType EventDetails
4 11-15-2012 22:55 1234 Anywhere Ln Fall In backyard. Possible fracture.
3 11-15-2012 22:54 1234 Anywhere Ln Fall In backyard.
2 11-15-2012 17:04 4321 Freedom Ct Heart Attack Short of breath. Took 2 NTG.
1 11-15-2012 13:43 9871 Main St Traffic Crash 2 vehicles in intersection. Fluid leaking from vehicle.
What I would like to receive when I query the database is this:
EventID EventReceived EventAddress EventType EventDetails
4 11-15-2012 22:55 1234 Anywhere Ln Fall In backyard. Possible fracture.
2 11-15-2012 17:04 4321 Freedom Ct Heart Attack Short of breath. Took 2 NTG.
1 11-15-2012 13:43 9871 Main St Traffic Crash 2 vehicles in intersection. Fluid leaking from vehicle.
At first, I tried this:
SELECT * FROM table_name WHERE (EventID) IN (SELECT MAX(EventID) FROM table_name GROUP BY EventAddress) ORDER BY EventReceived DESC;
Generally, it seems to work, although there are at least two issues:
- the query is very slow with even a few hundred records (~15 seconds)
- I suspect I'm missing events where the
EventAddress
is the same, but the events are, say, several days apart. These are obviously NOT the same event, but would probably get rolled up because I'm only grabbing the highestEventID
within the grouped addresses.
So:
- how can I improve the efficiency of the query, and
- how can I account for the importance of the time factor? (I can impose some arbitrary time limit on what should be considered the same event; 15 minutes, for example).
Would it be advisable to put this logic in my application rather than relying on a SQL query?
Thank you to anyone who's made it this far and has any ideas that might be helpful!
Note: I took a look at this question, thinking it might be helpful, but I'm not sure if I can make it work since the times will not be exactly the same.