23

Situation

My goal is to have a yearly cronjob that deletes certain data from a database based on age. To my disposal I have the powers of Bash and MySQL. I started with writing a bash script but then it struck me that maybe, I could do everything with just a single SQL query.

I'm more a programmer by nature and I haven't had much experience with data structures so that's why I would like some help.

Tables / data structure

The relevant tables and columns for this query are as follows:

Registration:

+-----+-------------------+
| Id  | Registration_date |
+-----+-------------------+
|   2 | 2011-10-03        | 
|   3 | 2011-10-06        | 
|   4 | 2011-10-07        | 
|   5 | 2011-10-07        | 
|   6 | 2011-10-10        | 
|   7 | 2011-10-13        | 
|   8 | 2011-10-14        | 
|   9 | 2011-10-14        | 
|  10 | 2011-10-17        |
+-------------------------+ 

AssociatedClient:

+-----------+-----------------+
| Client_id | Registration_id |
+-----------+-----------------+
|         2 |               2 | 
|         3 |               2 | 
|         3 |               4 | 
|         4 |               5 | 
|         3 |               6 | 
|         5 |               6 | 
|         3 |               8 | 
|         8 |               9 | 
|         7 |              10 | 
+-----------------------------+

Client: only Id is relevant here.

As you can see, this is a simple many-to-many relationship. A client can have multiple registrations to his name, and a registration can have multiple clients.

The goal

I need to delete all registrations and client data for clients who have not had a new registration in 5 years. Sounds simple, right?

The tricky part

The data should be kept if any other client on any registration from a specific client has a new registration within 5 years.

So imagine client A having 4 registrations with just him in them, and 1 registration with himself and client B. All 5 registrations are older than 5 years. If client B did not have a new registration in 5 years, everything should be deleted: client A registrations and record. If B did have a new registration within 5 years, all client A data should be kept, including his own old registrations.

What I've tried

Building my query, I got about this far:

DELETE * FROM `Registration` AS Reg
WHERE TIMESTAMPDIFF(YEAR, Reg.`Registration_date`, NOW()) >= 5
AND 
    (COUNT(`Id`) FROM `Registration` AS Reg2
     WHERE Reg2.`Id` IN (SELECT `Registration_id` FROM `AssociatedClient` AS Clients
                         WHERE Clients.`Client_id` IN (SELECT `Client_id` FROM `AssociatedClient` AS Clients2
                                                       WHERE Clients2.`Registration_id` IN -- stuck
               #I need all the registrations from the clients associated with the first
               # (outer) registration here, that are newer than 5 years.

    ) = 0 -- No newer registrations from any associated clients

Please understand that I have very limited experience with SQL. I realise that even what I got so far can be heavily optimised (with joins etc) and may not even be correct.

The reason I got stuck is that the solution I had in mind would work if I could use some kind of loop, and I only just realised that this is not something you easily do in an SQL query of this kind.

Any help

Is much appreciated.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
MarioDS
  • 12,895
  • 15
  • 65
  • 121
  • 5
    "a registration can have multiple clients" : your sample data in table `AssociatedClient` does not show this fact. Am I right if I assume the following two records may coexist in `AssociatedClient` : `(2,2), (3,2)` ? – RandomSeed Feb 26 '13 at 10:18
  • @Yak yeah this was a coincidence in what I've copied, I'll change it to include your example which is indeed valid. – MarioDS Feb 26 '13 at 10:19
  • 4
    If ClientA shares a registration_ID with ClientB, and ClientB shares a registration with ClientC, and only client C has a registration in the last 5 years should client A be retained due to it's link with client B? If so how deep can this recursive relationship go? – GarethD Feb 26 '13 at 10:24
  • @GarethD For simplicity's sake I would like to see a solution that works for 1 level only (i.e. just client A and B). In practice the situation you describe is probably very unlikely to occur. Please ignore it for now, I'll deal with that later, but if you could provide a solution that works for all levels it would be a great bonus. – MarioDS Feb 26 '13 at 10:32
  • Sorry for the errors in my former answer- I've deleted it. According to this question: http://stackoverflow.com/questions/45494/mysql-error-1093-cant-specify-target-table-for-update-in-from-clause the problem is that MySQL won't let you use a `SELECT` in the `WHERE` clause of a delete. I suppose you could use a 2 -stage solution - use a `SELECT` on the pattern of my query to put the ids to delete into a temp table, then do `DELETE * FROM REGISTRATION WHERE REGISTRATION.ID IN TEMPTABLE` – Rich Tolley Feb 26 '13 at 17:31
  • @RichTolley that might do the trick, but I've decided to use bash instead. It may not be ideal to process mysql results, but it will be less troublesome than trying to figure out the query. – MarioDS Feb 27 '13 at 08:12

6 Answers6

19

Begin by identifying the registrations of the other clients of a registration. Here's a view:

create view groups as 
select   a.Client_id
       , c.Registration_id
from AssociatedClient as a 
join AssociatedClient as b on a.Registration_id = b.Registration_id 
join AssociatedClient as c on b.Client_id = c.Client_id;

That gives us:

select Client_id
    , min(Registration_id) as first
    , max(Registration_id) as last
    , count(distinct Registration_id) as regs
    , count(*) as pals
from  groups 
group by Client_id;
Client_id   first       last        regs        pals      
----------  ----------  ----------  ----------  ----------
2           2           8           4           5         
3           2           8           4           18        
4           5           5           1           1         
5           2           8           4           5         
7           10          10          1           1         
8           9           9           1           1         

You dont' need a view, of course; it's just for convenience. You could just use a virtual table. But inspect it carefully to convince yourself it produces the right range of "pal registrations" for each client. Note that the view does not reference Registration. That's significant because it produces the same results even after we use it to delete from Registration, so we can use it for the second delete statement.

Now we have a list of clients and their "pal registrations". What's the date of each pal's last registration?

select g.Client_id, max(Registration_date) as last_reg
from groups as g join Registration as r
on g.Registration_id = r.Id
group by g.Client_id;
g.Client_id  last_reg  
-----------  ----------
2            2011-10-14
3            2011-10-14
4            2011-10-07
5            2011-10-14
7            2011-10-17
8            2011-10-14

Which ones have a latest date before a time certain?

select g.Client_id, max(Registration_date) as last_reg
from groups as g join Registration as r
on g.Registration_id = r.Id
group by g.Client_id
having max(Registration_date) < '2011-10-08';
g.Client_id  last_reg  
-----------  ----------
4            2011-10-07

IIUC that would mean that client #4 should be deleted, and anything he registered for should be deleted. Registrations would be

select * from Registration
where Id in (
      select Registration_id from groups as g
      where Client_id in ( 
            select g.Client_id
            from groups as g join Registration as r
            on g.Registration_id = r.Id
            group by g.Client_id
            having max(Registration_date) < '2011-10-08'
      )
);
Id          Registration_date
----------  -----------------
5           2011-10-07       

And, sure enough, client #4 is in Registration #5, and is the only client subject to deletion by this test.

From there you can work out the delete statements. I think the rule is "delete the client and anything he registered for". If so, I'd probably write the Registration IDs to a temporary table, and write the deletes for both Registration and AssociatedClient by joining to it.

James K. Lowden
  • 7,574
  • 1
  • 16
  • 31
1

You want to know all registrations that need to be kept. So your first query returns registrations within 5 previous years :

SELECT
  Id
FROM
  Registration
WHERE
  Registration_date >= '2011-10-08'

then all registrations with clients related to the previous query :

SELECT
  a2.Registration_id as Id
FROM
  AssociatedClient AS a1
  INNER JOIN AssociatedClient AS a2
    ON a1.Client_id = a2.Client_id 
WHERE
  a1.Registration_id IN
  (  
    SELECT
      Id
    FROM
      Registration
    WHERE
      Registration_date >= '2011-10-08'
 )

Then you have all registrations that you must not delete by combining the previous queries in an UNION, and you want all clients that are not part of this query :

SELECT
  Client_id
FROM
  AssociatedClient
WHERE
  Registration_id NOT IN
  (
    SELECT
      Id
    FROM
      Registration
    WHERE
      Registration_date >= '2011-10-08'
    UNION
    SELECT
      a2.Registration_id as Id
    FROM
      AssociatedClient AS a1
      INNER JOIN AssociatedClient AS a2
        ON a1.Client_id = a2.Client_id 
    WHERE
      a1.Registration_id IN
      (  
        SELECT
          Id
        FROM
          Registration
        WHERE
          Registration_date >= '2011-10-08'
      )
  )

you can see the results in this SQL fiddle

Then you can delete the lines of clients without registration correspondig to the criterias using the following query :

DELETE FROM
  AssociatedClient
WHERE
  Client_id IN (<previous query>);

and all registrations not present in AssociatedClient :

DELETE FROM
  Registration
WHERE
  Id NOT IN (SELECT Registration_id FROM AssociatedClient)
psadac
  • 2,312
  • 5
  • 32
  • 41
  • +1 for doing it the other way around (selecting those I want to keep) and for using `UNION`, the kind of operator I tend to forget about too often. – MarioDS Sep 26 '14 at 08:41
0

Use temporary tables.

INSERT INTO LockedClient(client_id) --select clients that should not be deleted
SELECT DISTINCT ac.client_id 
FROM AssociatedClient ac
JOIN Registration r ON r.Id = ac.ID
WHERE TIMESTAMPDIFF(YEAR, Reg.`Registration_date`, NOW()) >= 5;

DELETE  * FROM Registration r -- now delete all except locked clients
JOIN AssociatedClient ac ON ac.registration_id = r.id
LEFT JOIN LockedClient lc ON lc.client_id = ac.client_id
WHERE TIMESTAMPDIFF(YEAR, Reg.`Registration_date`, NOW()) >= 5 AND lc.client_id IS NULL
ijrandom
  • 745
  • 6
  • 9
  • Sorry, this isn't really a solution. The whole problem is finding the client Id's that should not be deleted. You're just saying what to do with them once I got them. – MarioDS Feb 26 '13 at 13:29
0

This should give you the proper clients information 1 level down into the linked clients. I know that this may not give you all the needed information. But, as stated in the comments, a 1 level implementation should be sufficient for now. This may not be optimal.

SELECT
AC1.Client_id,
MAX(R.Registration_date) AS [LatestRegistration]
FROM
#AssociatedClient AC1
JOIN #AssociatedClient AC2
    ON  AC1.Registration_id = AC2.Registration_id
JOIN #AssociatedClient AC3
    ON  AC2.Client_id = AC3.Client_id
JOIN #Registration R
    ON  AC3.Registration_id = R.Id
GROUP BY
AC1.Client_id

You should look into a function using loops. That's the only thing I can think about right now.

0

I'm a SQL Server guy, but I think this syntax will work for MySQL. This query will pull the clients that should not be deleted.

SELECT A3.Client_id
FROM AssociatedClient A1
#Get clients with registrations in the last 5 years
JOIN Registration R1 ON A1.Registration_id = R1.Id 
    AND TIMESTAMPDIFFERENCE(YEAR, R1.Registration_Date, Now()) <= 5
#get the rest of the registrations for those clients
JOIN AssociatedClient A2 ON A1.Client_id = A2.Client_id
#get other clients tied to the rest of the registrations
JOIN AssociatedClient A3 ON A2.Registration_id = A3.Registration_id
Jim
  • 3,482
  • 22
  • 18
0

You need two sql delete statement, because you are deleting from two tables.

Both delete statements need to distinguish between registrations which are being kept and those being deleted, so the delete from the registration table needs to happen second.

The controlling issue is the most recent registration associated with an id (a registration id or a client id). So you will be aggregating based on id and finding the maximum registration date.

When deleting client ids, you delete those where the aggregate registration id is older than five years. This deletion will disassociate registration ids which were previously linked, but that is ok, because this action will not give them a more recent associated registration date.

That said, once you have the client ids, you'll need a join on registration ids which finds associated registration ids. You'll need to join to client ids and then self join back to registration ids to get that part to work right. If you've deleted all client ids which were associated with a registration you'll need to delete those registrations also.

My sql is a bit rusty, and my mysql rustier, and this is untested code, but this should be reasonably close to what I think you need to do:

delete from associatedclient where client_id in (
  select client_id from (
    select ac.client_id, max(r.registration_date) as dt
      from associatedclient ac
        inner join registration r
          on ac.registration_id = r.id
      group by ac.client_id
  ) d where d.dt < cutoff
)

The next step would look something like this:

delete from registration where id in (
  select id from (
    select r1.id, max(r2.date) dt
      from registration r1
        inner join associated_client ac1
          on r1.id = ac1.registration_id
        inner join associated_client ac2
          on ac1.client_id = ac2.client_id
        inner join registration r2
          on ac2.registration_id = r2.id
) d
  where d.dt < cutoff
  or d.dt is null

I hope you don't mind me reminding you, but you should want to run the select statements without the deletes, first, and inspect the result for plausibility, before you go ahead and delete stuff.

(And if you have any constraints or indices which prevent this from working you'll have to deal with those also.)

rdm
  • 658
  • 5
  • 16