Weeding out duplicates based on comparisons of multiple fields

Question

I'm working on a system that receives and parses emails that are formatted in a structured way and saves the information to a MySQL database. Each email describes a specific event. However, sometimes a single event can generate multiple emails because a new email is sent when the event details are updated by a dispatcher. This can happen multiple times over the course of a few minutes as new information is added.

I'm trying to write a query that will select only the most recent row that describes a given event.

Below is an example of what the table might look like. Note that EventID's 3 and 4 refer to the same event. Event 4 is simply a more recently received email with more details. A human can quickly look at the address, event type, and received times, and infer that they refer to the same event. The question is: how can I translate this into a query with some reasonable accuracy?

EventID    EventReceived        EventAddress        EventType      EventDetails
4          11-15-2012 22:55     1234 Anywhere Ln    Fall           In backyard. Possible fracture.
3          11-15-2012 22:54     1234 Anywhere Ln    Fall           In backyard.
2          11-15-2012 17:04     4321 Freedom Ct     Heart Attack   Short of breath. Took 2 NTG.
1          11-15-2012 13:43     9871 Main St        Traffic Crash  2 vehicles in intersection. Fluid leaking from vehicle.

What I would like to receive when I query the database is this:

EventID    EventReceived        EventAddress        EventType      EventDetails
4          11-15-2012 22:55     1234 Anywhere Ln    Fall           In backyard. Possible fracture.
2          11-15-2012 17:04     4321 Freedom Ct     Heart Attack   Short of breath. Took 2 NTG.
1          11-15-2012 13:43     9871 Main St        Traffic Crash  2 vehicles in intersection. Fluid leaking from vehicle.

At first, I tried this:

SELECT * FROM table_name WHERE (EventID) IN (SELECT MAX(EventID) FROM table_name GROUP BY EventAddress) ORDER BY EventReceived DESC;

Generally, it seems to work, although there are at least two issues:

the query is very slow with even a few hundred records (~15 seconds)
I suspect I'm missing events where the EventAddress is the same, but the events are, say, several days apart. These are obviously NOT the same event, but would probably get rolled up because I'm only grabbing the highest EventID within the grouped addresses.

So:

how can I improve the efficiency of the query, and
how can I account for the importance of the time factor? (I can impose some arbitrary time limit on what should be considered the same event; 15 minutes, for example).

Would it be advisable to put this logic in my application rather than relying on a SQL query?

Thank you to anyone who's made it this far and has any ideas that might be helpful!

Note: I took a look at this question, thinking it might be helpful, but I'm not sure if I can make it work since the times will not be exactly the same.

score 0 · Answer 1 · answered Nov 16 '12 at 04:45

One possible optimization solution would be to run a frequent process to delete the duplicates.

Eg: Run a script every 10 minutes to go through all the records in the past 10-11 min, identify the duplicates, and delete them.

The frequency of the script execution, and the time period it works on, could be best determined by your system's requirements.

score 0 · Answer 2 · answered Nov 16 '12 at 13:40

You can try something like this:

SELECT t1.* 
FROM table_name t1
LEFT JOIN table_name t2 ON t2.EventAddress = t1.EventAddress AND t2.EventReceived > t1.EventReceived 
WHERE t2.EventID IS NULL

I have no way to test the syntax at the moment, but the idea is to self join the table against the same events that are more recent, and only keep the events that do not have a more recent entry.

Weeding out duplicates based on comparisons of multiple fields

2 Answers2