2

I have a somewhat complicated question related to MySQL. This is the table I have:

keyword   args     title    namespace_id      ratio
en        1        A        23                0.5
en        1        B        89                0.6
en        0        C        89                0.4
foo       1        Foo      23                0.7
bar       1        Bar      89                0.3

I want a list of all rows without duplicates of (keywords,args). If there are duplicates, the picked row should be selected by an order of namespace_ids I provide. The rows left should be ordered by ratio.

Example result with namespace_id order 23,89,x,y:

keyword   args     title    namespace_id      ratio
foo       1        Foo      23                0.7
en        1        A        23                0.5
en        0        C        89                0.4
bar       1        Bar      89                0.3

Example result with namespace order 89,23,x,y:

keyword   args     title    namespace_id      ratio
foo       1        Foo      23                0.7
en        1        B        89                0.6
en        0        C        89                0.4
bar       1        Bar      89                0.3

Is there any way I can to this directly in MySQL? I had a look at GROUP BY, ORDER BY and noticed even the GROUP_CONCAT() function but I didn't manage to put it all together properly. The statement that does somehow I want is:

  SELECT keyword, args, title, namespace_id, ratio 
    FROM tbl 
GROUP BY keyword, args 
ORDER BY ratio DESC;

But now I don't know how to bring in the namespace_id order.

I've found similar questions like this: MySQL: "order by" inside of "group by" The answer there close to what I want, however, my namespace_id order is varying and cannot be calculated by the MAX() function.

EDIT: The challenge here is to tell GROUP BY which row to pick. The normal ORDER BY apparently doesn't, it only uses the output of GROUP BY.

Community
  • 1
  • 1
Georg Jähnig
  • 779
  • 1
  • 8
  • 19
  • Also see the FIELD function, per http://stackoverflow.com/a/9378709/161052 also related: http://stackoverflow.com/q/3601258/161052 – JYelton Feb 23 '12 at 17:19
  • 1
    How long is the list of items you want to sort by? In your example, you show 89, 23, x, y. Does this mean you expect to only have four values? Can it be more? Up to how many? You might want to also use CASE. – JYelton Feb 23 '12 at 17:31
  • In theory, there can be more `namespace_id` values to order on but I don't expect them to be more than 5 to 6. – Georg Jähnig Feb 23 '12 at 22:15
  • The answer to the question you linked to (http://stackoverflow.com/questions/5362160/mysql-order-by-inside-of-group-by) will not help you. If you use their technique, you will get an arbitrary value for the `title` column. – Hanno Fietz Feb 25 '12 at 16:14
  • Unrelated to your question, but: What cardinality of the keyword / arg pair do you expect in relation to the number of entries in that table? Would it make sense to normalize that out? – Hanno Fietz Feb 25 '12 at 16:46

5 Answers5

2

Try using FIELD() and a subquery:

SELECT t1.keyword, t1.args, t1.title, t1.namespace_id, t1.ratio
FROM tbl t1, (SELECT keyword, args, MIN(FIELD(namespace_id, 23, 89))
minfield FROM tbl GROUP BY keyword, args) t2
WHERE t1.keyword = t2.keyword AND t1.args = t2.args AND
FIELD(t1.namespace_id, 23, 89) = t2.minfield
ORDER BY ratio DESC;

or a self-outer-join:

SELECT t1.keyword, t1.args, t1.title, t1.namespace_id, t1.ratio
FROM tbl t1
LEFT OUTER JOIN tbl t2 ON
    t1.keyword = t2.keyword AND
    t1.args = t2.args AND
    FIELD(t1.namespace_id, 23, 89) < FIELD(t2.namespace_id, 23, 89)
WHERE t2.keyword IS NULL AND t2.args IS NULL
ORDER BY t1.ratio DESC;

Edit: It might also be worth having a look at some (commercial) DBMS supporting OLAP operations (if you're able to choose and if you are processing large sets of data). For Georg's case I think the OVER and PARTITION BY keywords would have helped: http://publib.boulder.ibm.com/infocenter/rbhelp/v6r3/topic/com.ibm.redbrick.doc6.3/sqlrg/sqlrg36.htm#sii06377181

Stefan Schramm
  • 144
  • 1
  • 3
0

I think this is what you want -

SELECT t1.* FROM tbl t1
  JOIN(SELECT keyword, args, MIN(ratio) ratio FROM tbl GROUP BY keyword, args) t2
    ON t1.keyword = t2.keyword AND t1.args = t2.args AND t1.ratio = t2.ratio
  ORDER BY ratio DESC;

and second one -

SELECT t1.* FROM tbl t1
  JOIN(SELECT keyword, args, MAX(ratio) ratio FROM tbl GROUP BY keyword, args) t2
    ON t1.keyword = t2.keyword AND t1.args = t2.args AND t1.ratio = t2.ratio
  ORDER BY ratio DESC;
Devart
  • 119,203
  • 23
  • 166
  • 186
  • 1
    How does this address a custom sort sequence for `namespace_id`? – JYelton Feb 23 '12 at 17:44
  • This actually brings the right results in both cases but apparently only by accident. After changing all rows with `keyword='en'` to `ratio=0.5`, it doesn't even group anymore. – Georg Jähnig Feb 23 '12 at 22:12
0

You could use the FIELD() to generate a custom sort sequence like this:

SELECT keyword, args, title, namespace_id, FIELD(namespace_id, 32, 89) sorting, ratio 
FROM tbl 
GROUP BY keyword, args
ORDER BY sorting DESC, ratio DESC;

Note that any namespace_id not specified in the FIELD() function will receive a 0 sorting value, so to get these items to appear first in the results, you must specify them in reverse order and use DESC for the sort order.

Thus FIELD(namespace_id, 32, 89) with ORDER BY sorting DESC will result in:

x x 89 x
x x 32 x
...

Whereas FIELD(namespace_id, 89, 32) with ORDER BY sorting DESC will result in:

x x 32 x
x x 89 x
...
JYelton
  • 35,664
  • 27
  • 132
  • 191
  • This will not help if he wants to sort against the natural ordering of the values in `namespace_id`, which, looking at his post, might be the case. – Hanno Fietz Feb 23 '12 at 21:02
  • `FIELD()` looked really promising and I just tried it out. However, it doesn't do what I want. The row with `title=B` never gets picked, no matter what order I put into `FIELD()`. And they're not ordered by `ratio` but first by `sorting`. – Georg Jähnig Feb 23 '12 at 21:46
0

One slightly hackish way to do this would be to order by a number of boolean expressions, like this:

SELECT keyword, args, title, namespace_id, ratio 
FROM tbl 
GROUP BY keyword, args 
ORDER BY namespace_id != 89,
         namespace_id != 23,
         namespace_id != x,
         namespace_id != y,
         ratio DESC;

Obviously, this becomes impractical very quickly.

If you really have to do this on the SQL side, I suggest that you create another table containing the columns namespace_id (which should have a UNIQUE constraint) and priority (or similar). Then you JOIN in that table and ORDER BY priority.

Chances are that you already have a table for the namespaces referred to by namespace_id. In that case, just add a priority column to that table.

Hanno Fietz
  • 30,799
  • 47
  • 148
  • 234
  • This does basically the same as using `FIELD`, just more hackish :). The problem still remains, how to tell `GROUP BY` which row to pick. `ORDER BY` apparently doesn't, just takes the output of `GROUP BY`. – Georg Jähnig Feb 23 '12 at 21:53
  • I see. I'm not sure if you can influence which rows remain after `GROUP`ing, at least I think `ORDER BY` is executed on the result of `GROUP BY`. There's "group functions", such as `MIN()` and `MAX()`, but off the bat, I'm not sure how they would help. – Hanno Fietz Feb 24 '12 at 09:26
0

After reconsidering your question and your comments on my previous answer, I think you just can't do this. Here's why:

Since you want to filter rows after ordering them, your only option would be the HAVING clause, which to my knowledge is the only thing that is processed after the ORDER BY clause.

Since the HAVING clause only looks at each row individually, but you want to filter by its relative position in the set (i. e. you want only the first row of each subset with identical keyword / arg), you would have to "smuggle" that as a value into each row. I briefly thought about this option, but could only come up with nonsense.

Skip the GROUPing, get the ORDER right (by associating a sortable priority value with each namespace_id), and then, in your code, use a data structure that's keyed to keyword and arg, and ignore duplicates when filling it with the result set.

If you for some arcane reason absolutely have to do everything in SQL, you can simulate what I described above by creating a temporary table (in memory) that has a UNIQUE constraint on the two columns and do INSERT IGNORE INTO temp_table SELECT ...


As a side note: Everytime you notice that you want SQL to do something to a row that is dependent on other rows in the result set, you're probably out of luck. I've repeatedly run into cases that I thought would be easy, until I noticed that I approached result sets as if they were loops. My favourite anecdote is the time where I tried desperately to subtract the value in one column from its counterpart in the preceding row. Sometimes you can try to hack your way out with weird JOINs (pulling the related rows together into one), but even if that works, it's probably expensive.

Hanno Fietz
  • 30,799
  • 47
  • 148
  • 234