How do I efficiently execute large queries?

Question

Consider the following demo schema

trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());

I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.

In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.

What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?

Can you give a numerical example of what you're trying to achieve? Just seeing the schema isn't enough to give an optimal answer in my view. Ideally if you could provide a small example data and what you expect the output to be — terrylynch, Nov 11 '21 at 09:58

SeanHehir · Accepted Answer · 2021-11-11T08:05:24.770

For joining to the next closest time after an event take a look at this question: KDB reverse asof join (aj) ie on next quote instead of previous one

Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).

If that's not what you're looking for then I could give some more specifics although some sample data would be useful.

score 3 · Answer 2 · answered Nov 10 '21 at 22:08

My thoughts:

The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:

res: select from trades;
res:aj[`ccy`arrivalTime;
  res;
  select ccy:currency, arrivalTime:dateTime, fxRate from usdRates 
  ]
res:update someFunc fxRate from res;

Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol

How do I efficiently execute large queries?

2 Answers2