0

I have a table like this:

id (PRIMARY KEY, AUTO_INCREMENT) || user_id || timestamp || data1 || data2 || data3 (...)

I need to extract a single data for every user_id ("the newest row") that is, with respect to latest timestamp in the table.

The answer given here works fine, but I am concerned with performance of this query: MySQL - using GROUP BY and DESC

I have also tested some answers here, but receive "Too few arguments" error and empty query: https://stackoverflow.com/a/7306288/2715309

Select only newest grouped entries

What am I doing wrong and what is the most optimal way, given the fact that I have unique auto_inc column as primary key?

Thanks

Community
  • 1
  • 1
dev101
  • 1,359
  • 2
  • 18
  • 32

1 Answers1

1

The selected answer in your first answer uses a MySQL extension that is explicitly documented not to always work (and I've commented on that answer). Here is a link to the documentation page: http://dev.mysql.com/doc/refman/5.7/en/group-by-extensions.html.

The second version can work, if you use a correlated subquery or join:

select *
from table t
where t.timestamp = (select max(t2.timestamp)
                     from table t2
                     where t2.user_id = t.user_id
                    );

This should have reasonable performance if you have an index on table(user_id, timestamp).

A similar version to this uses join with aggregation:

select t.*
from table t join
     (select t2.user_id, max(t2.timestamp) as maxts
      from table t2
      group by t2.user_id
     ) tmax
     on t2.user_id = t.user_id and t2.maxts = t.timestamp;

EDIT:

Try this variant with the same index:

select *
from table t
where not exists (select 1
                  from table t2
                  where t2.user_id = t.user_id and t2.timestamp > t.timestamp
                 );

This is the form I usually recommend.

Nimantha
  • 6,405
  • 6
  • 28
  • 69
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Hi Gordon, thanks for you answers, I have tried your both suggestions, only replacing 'table' with actual table name, and again received too few args error. I am not sure why, but I need to use select * since in later processing every column data is used. Also, statement is processed by internal app's functions, maybe it does not understand the statement completely and complains about missing arguments like user_id and timestamp, which are not defined prior, that is why I use/need * – dev101 Sep 07 '14 at 09:56
  • In any case, should I use index on timestamp and user_id - will it help speed up things in first case that works for me (with two SELECT * queries) ? This table will get large easily (like 10M rows). – dev101 Sep 07 '14 at 09:58
  • @dev101 . . . Yes. You want the index on `user_id, timestamp`, with the columns specified in that order. – Gordon Linoff Sep 07 '14 at 10:45
  • Thanks! I was reading about indexes and there is a possibility that with mixing DESC and ASC indexes won't always work, hopefully you are correct. – dev101 Sep 07 '14 at 11:14
  • Hi Gordon and everyone who might read this in the future, well, I have generated 100k rows of data, created indexes and rebuilt the table, and done some benchmarks (with double SELECT * query that works). With indexed table the result was 115 seconds for a full table scan on a very modest computer. The same test without indexes took exactly 115 seconds, which proved that those articles are unfortunately correct. Will keep investigating why those other queries do not work. – dev101 Sep 07 '14 at 22:10
  • @dev101 . . . This is strange. But "created indexes"? Only one index, with two columns, is needed for the query. – Gordon Linoff Sep 07 '14 at 22:39
  • Well, I used phpMyAdmin to create index on first column, then second, then import-export to rebuild. I do not know how to create index only once on 2 columns at the same time. Will search for answer about this. – dev101 Sep 08 '14 at 08:01
  • I have also tried this query: SELECT * FROM table GROUP BY user_id; since later using another query to do some other extractions, sorting by timestamp and LIMIT to 1 (in the php loop) and cache is probably taking into effect there, but the results (without index/indexes) are 10 times faster than double select one -- now I am confused. – dev101 Sep 08 '14 at 08:06
  • Hmm, I tried several ways, like CREATE INDEX `someidx` on `table` (`user_id`, `timestamp`,) ; after CREATE TABLE (); part in exported files, but on import get SQL syntax errors. Not sure what is wrong, stil trying to find out. – dev101 Sep 08 '14 at 09:04
  • update: well, seems like one comma after timestamp, was too much (and also noticed that SO formatting strips quotes from words), importing in progress... – dev101 Sep 08 '14 at 09:07
  • results: it improved down to 105 seconds, but that query is just too slow. Nevermind, I dropped that idea and now using the simple query posted few comments above, seems to do the trick. Thanks for your help! Regards – dev101 Sep 08 '14 at 09:20