Identify N maxes of a row, discarding the remaining bottom values from table rows

Question

I've got stuck on the following problem and was hoping for some help. I've tried a few things and have used some information found on Stack Overflow (such as this/How to apply max function for each row in KDB? and this/Iterate over current row values in kdb query, and flipping then sliding windows as per my previous post), and flicked through my Q for Mortals again but for some reason I have hit a brick wall.

In my table, the first column is a date column, the rest are numbers. From this I'm trying to generate a table that only has the n maximum numbers of a row left, the rest set to zero or 0N (or, if you like, where the m bottom values have been discarded).

Example:

Starting table:

q)t:([] td:2001.01.01 2001.01.02 2001.01.03 2001.01.04 2001.01.05 2001.01.06; 
AA:121.5 125.0 127.0 126.0 129.2 130.0; BB:111.0 115.3 117.0 116.0 119.2
120.0; CC:120.0 126.0 125.5 128.8 135.0 130.0; DD:120.1 123.3 128.4 128.3 
127.5 126.0; NN:122.0 125.5 126.0 116.0 109.0 100.5)

td         AA    BB    CC    DD    NN   
----------------------------------------
2001.01.01 121.5 111   120   120.1 122  
2001.01.02 125   115.3 126   123.3 125.5
2001.01.03 127   117   125.5 128.4 126  
2001.01.04 126   116   128.8 128.3 116  
2001.01.05 129.2 119.2 135   127.5 109  
2001.01.06 130   120   130   126   100.5

The desired end result when identifying the 2 maximums per row and blanking the rest (with either 0 or 0n):

td         AA    BB CC    DD    NN   
-------------------------------------
2001.01.01 121.5                122  
2001.01.02          126         125.5
2001.01.03 127            128.4      
2001.01.04          128.8 128.3      
2001.01.05 129.2    135              
2001.01.06 130      130

To take row 1 as an example, the top 2 values in AA and NN of that row have been left whilst the two others in BB and CC have been blanked out.

To get only the max value, i.e. the one top value, I can do the below and use the newly added column in a followup update statement. However, the problem here is that I need to find the n maxes and discard the rest.

q)update maxes:max(AA;BB;CC;DD;NN) from t

Not sure if it's of any interest, but as an example of what I have tried: If I use a tip from another stack overflow post and I execute that on the values themselves I can sort of get there, but not in a table format:

q)nthMax:{x (idesc x)[y-1]}
{x (idesc x)[y-1]}
q)nthMax[(121.5 111 120 120.1   122);1]
122f
q)nthMax[(121.5 111 120 120.1   122);2]
121.5

However when I try to use this as part of an update or select then it's not working; also it strikes me as a non-q approach so interested in what folks have to say about solving the above.

Another example was I tried flipping the table then use MMAX, however as the dates are at the top they "survive". Also, this seems a bit clunky maybe as I have to do this for n times per column if I'm interested in n maxes, or drop x numbers that form the bottom values, leaving n max numbers.

Kind regards, Sven

Although I don't know your use case here, in general if you end up in a situation where you have to perform logic/operations across *rows* (as opposed to columnar operations) then you're likely not setting up your data/tables optimally. This operation could more easily be performed (and more efficiently for large datasets) if the table had one column for values and one column for the associated headers (AA, BB, CC etc). If you really needed the output to be a sparse/wide table with blanks then you could pivot the result http://code.kx.com/q/cookbook/pivoting-tables/ — terrylynch, Jan 08 '18 at 19:28
I was wondering if this was an issue - I'll start organising the data like this going forwards as the datasets can be quite large (millions and millions) so speed and efficiency are of concern. Many thanks, I'll restructure the data at a later junction. — Sven F., Jan 08 '18 at 23:26

Paul Kerrigan · Answer 1 · 2018-01-08T17:20:29.703

If you don't need the columns to stay in the same order, the following will get you the correct result:

key[kt]!(uj/) value {enlist (2#idesc x)#x}each kt:1!t

Results in:

td        | NN    AA    CC    DD      
----------| -----------------------   
2001.01.01| 122   121.5               
2001.01.02| 125.5       126           
2001.01.03|       127         128.4   
2001.01.04|             128.8 128.3   
2001.01.05|       129.2 135           
2001.01.06|       130   130

You could fix the order afterwards with "xcols" if it's important to you, or do this (which is a little longer but preserves columns which are never in the top n)

q)key[kt]!(uj/) value {enlist (key[x]!count[x]#0n),(2#idesc x)#x}each kt:1!t

td        | AA    BB CC    DD    NN                                         
----------| --------------------------                                      
2001.01.01| 121.5                122                                        
2001.01.02|          126         125.5                                      
2001.01.03| 127            128.4                                            
2001.01.04|          128.8 128.3                                            
2001.01.05| 129.2    135                                                    
2001.01.06| 130      130

Many thanks Paul, that I don't have to reference the columns is very convenient. — Sven F., Jan 08 '18 at 23:35
It's not necessary to create a dummy dictionary with `(key[x]!count[x]#0n)`, you can achieve the same thing with `key[x]#` - see my answer below for an example of this. You can also lose the `enlist` and that should stop you needing to use `uj` when all cols are present anyway (last example) — Jonathon McMurray, Jan 09 '18 at 09:56

score 3 · Accepted Answer · answered Jan 09 '18 at 09:52

Here's another option, possibly a little tidier:

q)0!{key[x]#(2#idesc x)#x}'[1!t]
td         AA    BB CC    DD    NN
-------------------------------------
2001.01.01 121.5                122
2001.01.02          126         125.5
2001.01.03 127            128.4
2001.01.04          128.8 128.3
2001.01.05 129.2    135
2001.01.06 130      130

This works on the assumption that the first column is the only one you don't want to consider for the maximums. It's similar to the other two answers in it's use of idesc. One part to note here is key[x]# which essentially adds null entries to a dictionary to ensure all keys are present. As an example of this:

q)`a`b`c#`a`c!1 2
a| 1
b|
c| 2

Note how b is in the resultant dictionary but not in the original dictionary. This is used to make sure the dictionary generated for each line conforms with the others, thus resulting in a table (which after all, is just a list of conforming dictionaries).

This is indeed tidier and better – Paul Kerrigan Jan 09 '18 at 11:52 — Paul Kerrigan, Jan 09 '18 at 11:52

Thomas Smyth - Treliant · Answer 3 · 2018-01-08T17:22:56.027

1

Here is an ugly bit of code that should work for your example:

{x,'flip y!flip{?[x>idesc y;y;0N]}[z]each flip x y}[t;`AA`BB`CC`DD`NN;2]
td         AA    BB CC    DD    NN
-------------------------------------
2001.01.01 121.5                122
2001.01.02          126         125.5
2001.01.03 127            128.4
2001.01.04          128.8 128.3
2001.01.05 129.2    135
2001.01.06 130      130

The function allows you to specify which columns should be included and how values in each row.

edited Jan 08 '18 at 17:22

answered Jan 08 '18 at 17:15

Thomas Smyth - Treliant

4,993
6
25
36

1

Thanks again Thomas - I was hoping to avoid referencing the column names as they change and typically are about 20-100 of them. Nice to see some flip magic though, I did pick up something I need for another part of my model. – Sven F. Jan 08 '18 at 23:29
If you have a large number of columns, you could use "cols[t] except `td" in lieu of writing out the column names, since as you've said in the original question, you have one date column and the rest should be included. – Paul Kerrigan Jan 09 '18 at 09:09

Identify N maxes of a row, discarding the remaining bottom values from table rows

3 Answers3