SQL request to both group responses and retrieve only the last line of each

Question

I have a table without IDs. it has 3 columns: the name of a computer, its status (on/off) at the moment of the poll, and the timestamp of the insertion.

if I run

select * from computers group by name;

I get a line for each computer (there are 200 different ones), but these lines don't always hold the latest entry for it. I then tried

select computers group by name order by timestamp asc;

But I get incoherent responses (some recent timestamps, some old ones... no idea why).

It's basically the same problem as here : SQL: GROUP BY records and then get last record from each group?, but I don't have ids to help :(

You have multiple row for each computer and you want the last row added by using timestamp colomn ? — David Bélanger, Mar 23 '12 at 16:12
@DavidBélanger yes, for each computer i want the last row added — Cystack, Mar 23 '12 at 16:24

ruakh · Accepted Answer · 2012-03-24T13:25:25.527

1

You can write:

SELECT computers.name,
       computers.status,
       computers.timestamp
  FROM ( SELECT name,
                MAX(timestamp) AS max_timestamp
           FROM computers
          GROUP
             BY name
       ) AS t
  JOIN computers
    ON computers.name = t.name
   AND computers.timestamp = t.max_timestamp
;

The above uses this subquery to finds the greatest timestamp for each name:

SELECT name
       MAX(timestamp) AS max_timestamp
  FROM computers
 GROUP
    BY name
;

and then it gathers fields from computers whose name and timestamp match something that the subquery returned.

The reason that your order by clause has no effect is that it comes too "late": it's used to order records that are going to be returned, after it's already determined that they will be returned. To quote from §11.16.3 "GROUP BY and HAVING with Hidden Columns" in the MySQL 5.6 Reference Manual on this subject:

The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Sorting of the result set occurs after values have been chosen, and ORDER BY does not affect which values the server chooses.

Another way is to write a correlated subquery, and dispense with the GROUP BY entirely. This:

SELECT name, status, timestamp
  FROM computers AS c1
 WHERE NOT EXISTS
        ( SELECT 1
            FROM computers
           WHERE name = c1.name
             AND timestamp > c1.timestamp
        )
;

finds all rows in computers that haven't been superseded by more-recent rows with the same name. The same approach can be done with a join:

SELECT c1.name, c1.status, c1.timestamp
  FROM computers AS c1
  LEFT
 OUTER
  JOIN computers AS c2
    ON c2.name = c1.name
   AND c2.timestamp > c1.timestamp
 WHERE c2.name IS NULL
;

which is less clear IMHO, but may perform better.

edited Mar 24 '12 at 13:25

answered Mar 23 '12 at 16:21

ruakh

175,680
26
273
307

it works EXACTLY like I wanted it, BUT the request took 19 seconds, and my database is only 2 days old. It will quickly become unsustainable, what can I do ? – Cystack Mar 23 '12 at 16:27
@Cystack: I've updated my answer to give some variants that dispense with the `GROUP BY` and may perform better. (The version with the `LEFT OUTER JOIN` is likely to perform the best, though it's also the least clear IMHO.) – ruakh Mar 23 '12 at 16:33
thank you very much. do you think I would be better off (faster) re writing and porting the table, and using an id for each entry ? – Cystack Mar 23 '12 at 17:25
or maybe having an additionnal table that only retains the last status of each computer ? – Cystack Mar 23 '12 at 17:26
Without knowing more about your data and your needs, it's hard for me to give you categorical advice; but yes, it's likely that you'll want two tables, one with current information (say, `id`, `name`, `current_status`, and `last_modified`) and one with historical information. Google "slowly changing dimension" for information on the different types of approaches that people use to handle these sorts of situations in general. – ruakh Mar 23 '12 at 17:58
Im reading it right now. Your last proposition is giving me a syntax error, and Im not good enough at SQL to detect where it comes from :( – Cystack Mar 24 '12 at 09:31
What is the exact SQL you're running, and what is the exact error you're getting? – ruakh Mar 24 '12 at 12:02
MySQL v5.5+, I get a "there is an error near "LEFT OUTER JOIN" etc. until the end of the request – Cystack Mar 24 '12 at 13:03
I just ran `create table computers (name varchar(20), status char(1), timestamp timestamp);` followed by my *exact* query above, and it did not give any syntax error. (It *did* give me the error that "Column 'name' in field list is ambiguous", which I had to address by prefixing each of the field-list columns with `c1.` -- I'll update my answer -- but nothing like what you describe.) Unless you're willing to post the exact query you're using, I think you're on your own for this one, sorry. :-/ – ruakh Mar 24 '12 at 13:24
@Cystack: As for speed of query, you should look at the indexes on the table and the query plan that is in use for the first query. The chances are you don't have any indexes and the system is having to do sequential scans over the table. This MySQL feature of allowing GROUP BY without listing all non-aggregate columns seems to cause more confusion than benefit. At least, it triggers quite a lot of questions on SO, but then again, those who get it to work as they want don't need to ask questions about why. – Jonathan Leffler Mar 24 '12 at 14:42
spot on @JonathanLeffler ;) what should I index? everything? – Cystack Mar 24 '12 at 15:02
2

@Cystack: A single index on `(name, timestamp)` should help considerably. – ruakh Mar 24 '12 at 15:02
would you mind explaining a bit on why this was the right thing to index ? – Cystack Mar 24 '12 at 17:07
The index on `(name, timestamp)`, which is what I would have suggested had I been paying attention, effectively does the grouping for you by listing all the rows for a given name together in the index, in time order. The optimizer doesn't need to read the data while processing the inner query, and it can rather easily find the maximum timestamp for each name. A 300-times performance improvement is extreme, but proper indexing can often help vastly. Over-indexing slows up both update operations (INSERT, DELETE and UPDATE), and the optimization process (because the optimizer has more options). – Jonathan Leffler Mar 24 '12 at 17:39

SQL request to both group responses and retrieve only the last line of each

1 Answers1