1

I have a specific request to do on my database (PostgreSQL v9.4.5), and I don't see any elegant solution in pure SQL to solve it (I know I can do it using Python or other, but I have several billions lines of data, and the calculation time would be greatly increased).

I have two tables : trades and events. These tables both represent the trades occurring in an orderbook during a day (this is why I have several billions lines, my data is over several years) but there are many more events than trades.

Both tables have columns time, volume and quantity, however each one has other columns (let's say respectively foo and bar) with specific information. I want to make a correspondence between the two tables on the columns time, volume and price, as I know this correspondence exists as an injection from trades to events (if there are n rows in trades with the same time t, the same price p and the same volume v, I know there are also n rows in events with the time t, the price p and the volume v).

Trades :

  id |   time    |  price  | volume |   foo
-----+-----------+---------+--------+-------
 201 | 32400.524 |      53 |   2085 |   xxx
 202 | 32400.530 |      53 |   1162 |   xxx
 203 | 32400.531 |   52.99 |     50 |   xxx
 204 | 32400.532 |   52.91 |   3119 |   xxx
 205 | 32400.837 |   52.91 |   3119 |   xxx <--
 206 | 32400.837 |   52.91 |   3119 |   xxx <--
 207 | 32400.837 |   52.91 |   3119 |   xxx <--
 208 | 32400.839 |   52.92 |   3220 |   xxx <--
 209 | 32400.839 |   52.92 |   3220 |   xxx <--
 210 | 32400.839 |   52.92 |   3220 |   xxx <--

Events :

  id |   time    |  price  | volume |  bar 
-----+-----------+---------+--------+------
 328 | 32400.835 |   52.91 |   3119 |  yyy
 329 | 32400.837 |   52.91 |   3119 |  yyy <--
 330 | 32400.837 |   52.91 |   3119 |  yyy <--
 331 | 32400.837 |   52.91 |   3119 |  yyy <--
 332 | 32400.838 |   52.91 |   3119 |  yyy
 333 | 32400.838 |   52.91 |   3119 |  yyy
 334 | 32400.839 |   52.92 |   3220 |  yyy <--
 335 | 32400.839 |   52.92 |   3220 |  yyy <--
 336 | 32400.839 |   52.92 |   3220 |  yyy <--
 337 | 32400.840 |   52.91 |   2501 |  yyy

What I want is :

   time    |  price  | volume |  bar |   foo 
-----------+---------+--------+------+-------
 32400.837 |   52.91 |   3119 |  xxx |   yyy
 32400.837 |   52.91 |   3119 |  xxx |   yyy
 32400.837 |   52.91 |   3119 |  xxx |   yyy
 32400.839 |   52.92 |   3220 |  xxx |   yyy
 32400.839 |   52.92 |   3220 |  xxx |   yyy
 32400.839 |   52.92 |   3220 |  xxx |   yyy

I cannot do a classic INNER JOIN, or else I will have all the possible crossing between the two tables (in this case I would have 6x6 then 36 rows).

The though thing is to have only one row versus one row, although several rows could fit.

Thank you for your help.

EDIT :

As I said, if I use a classic INNER JOIN, for example

SELECT * FROM events e,
INNER JOIN trades t
ON t.time = e.time AND t.price = e.price AND t.volume = e.volume

I will have something like :

trade_id | event_id |   time    |  price  | volume |  bar |   foo 
---------+----------+-----------+---------+--------+------+-------
  205    |   329    | 32400.837 |   52.91 |   3119 |  xxx |   yyy
  205    |   330    | 32400.837 |   52.91 |   3119 |  xxx |   yyy
  205    |   331    | 32400.837 |   52.91 |   3119 |  xxx |   yyy
  206    |   329    | 32400.837 |   52.91 |   3119 |  xxx |   yyy
  206    |   330    | 32400.837 |   52.91 |   3119 |  xxx |   yyy
  206    |   331    | 32400.837 |   52.91 |   3119 |  xxx |   yyy
  207    |   329    | 32400.839 |   52.91 |   3119 |  xxx |   yyy
  207    |   330    | 32400.839 |   52.91 |   3119 |  xxx |   yyy
  207    |   331    | 32400.839 |   52.91 |   3119 |  xxx |   yyy
  208    |   334    | 32400.837 |   52.92 |   3220 |  xxx |   yyy
  208    |   335    | 32400.837 |   52.92 |   3220 |  xxx |   yyy
  208    |   336    | 32400.837 |   52.92 |   3220 |  xxx |   yyy
  209    |   334    | 32400.837 |   52.92 |   3220 |  xxx |   yyy
  209    |   335    | 32400.837 |   52.92 |   3220 |  xxx |   yyy
  209    |   336    | 32400.837 |   52.92 |   3220 |  xxx |   yyy
  210    |   334    | 32400.839 |   52.92 |   3220 |  xxx |   yyy
  210    |   335    | 32400.839 |   52.92 |   3220 |  xxx |   yyy
  210    |   336    | 32400.839 |   52.92 |   3220 |  xxx |   yyy

But what I want is :

trade_id | event_id |   time    |  price  | volume |  bar |   foo 
---------+----------+-----------+---------+--------+------+-------
  205    |   329    | 32400.837 |   52.91 |   3119 |  xxx |   yyy
  206    |   330    | 32400.837 |   52.91 |   3119 |  xxx |   yyy
  207    |   331    | 32400.839 |   52.91 |   3119 |  xxx |   yyy
  208    |   334    | 32400.837 |   52.92 |   3220 |  xxx |   yyy
  209    |   335    | 32400.837 |   52.92 |   3220 |  xxx |   yyy
  210    |   336    | 32400.839 |   52.92 |   3220 |  xxx |   yyy
Edouard Berthe
  • 1,393
  • 2
  • 18
  • 38
  • in your on part you can use give more than one condition like: e.time=t.time and e.price=t.price and e.volume=t.volume – Mauri Nov 09 '15 at 10:57
  • You'll have to decide which records you want, which values for `foo` and `bar`. There are 36 combinations, as you say, all of these are technically correct. You'll have to tell sql which are logically correct, and use subqueries to select the right records to join – HoneyBadger Nov 09 '15 at 10:58
  • Generate a row_number over partition by your columns on both `Trades` and `Events` and then include this column in the inner join condition. – B0Andrew Nov 09 '15 at 11:07
  • @HoneyBadger I cannot do that because foo and bar are what I want. I don't know these values a priori – Edouard Berthe Nov 09 '15 at 11:13
  • @MohamedElHousseine Ok, could you give me an example of what I could do in the ON clause to arrive to this result ? – Edouard Berthe Nov 09 '15 at 11:14
  • which db/version are you using? – Utsav Nov 09 '15 at 11:16
  • @Utsav PostgreSQL v9.4.5 – Edouard Berthe Nov 09 '15 at 11:17
  • @Edourardb only use distict: SELECT distinct trade_id, event_id, time, price, volum, bar, foo FROM events e, INNER JOIN trades t ON t.time = e.time AND t.price = e.price AND t.volume = e.volume – Mauri Nov 09 '15 at 11:22
  • @MohamedElHousseine ok thank you, let me check that ! :) – Edouard Berthe Nov 09 '15 at 11:26
  • @MohamedElHousseine It doesn't work, it's displaying the same thing as if the DISTINCT weren't here.. – Edouard Berthe Nov 09 '15 at 11:30
  • @Edouardb try this please: SELECT * FROM events e, INNER JOIN (SELECT distinct trade_id, time, price, volum, bar FROM Trades) t ON t.time = e.time AND t.price = e.price AND t.volume = e.volume – Mauri Nov 09 '15 at 11:42
  • @MohamedElHousseine Ok, I'm trying ty – Edouard Berthe Nov 09 '15 at 11:52
  • Please rephrase "The [tough] thing is to have only one row versus one row, although several rows could fit." Use enough words and refer to enough things clearly. – philipxy Nov 09 '15 at 23:57

5 Answers5

0

Check this query -

SELECT Events.*,Trades.*
FROM Events
INNER JOIN Trades
ON Trades.time = Events.time
AND Trades.price = Events.price
AND Trades.volume = Events.volume
Edouard Berthe
  • 1,393
  • 2
  • 18
  • 38
mugzi
  • 809
  • 4
  • 16
  • 33
  • As I said, if I do this I will have too many correspondences. In fact, it's clearer if I display the Ids, I will edit my question. – Edouard Berthe Nov 09 '15 at 11:05
  • try partition method to get top row for each - example please check here -http://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group – mugzi Nov 09 '15 at 11:18
0

Try this and let me know if. We can also you row_number() over(partion by) clause but I am not sure if it will work on postgreSQL. Anyways try this.

SELECT 
  min(t.id) as trade_id,min(e.id) as event_id,
  min(t.time) as time,min(t.price) as price,
  min(t.volume) as volume,  min(e.bar) as bar,
  min(t.foo) as foo 
FROM events e,
  INNER JOIN trades t
ON t.time = e.time AND t.price = e.price AND t.volume = e.volume
group by t.id
Utsav
  • 7,914
  • 2
  • 17
  • 38
  • Ok, let me check that ! Thank you :) – Edouard Berthe Nov 09 '15 at 11:31
  • Hmm, it doesn't work. It's not grouping by event (one trade is linked to all the events which fit). I'm having a look on row_number and over(partition by). – Edouard Berthe Nov 09 '15 at 11:39
  • If the SQL query you gave works, then min should work too. Anyways, can you please use http://sqlfiddle.com and recreate the tables and you first select query and share the link? We can try something on that. – Utsav Nov 09 '15 at 11:43
0

Just looking at the sample data you have provided, one option would be:

SELECT e.id, min(t.id), e.time, e.price, e.volume, min(e.bar), min(t.foo)  FROM events e,
INNER JOIN trades t
ON t.time = e.time AND t.price = e.price AND t.volume = e.volume
GROUP BY e.id, e.time, e.price, e.volume
Verma
  • 956
  • 6
  • 21
0

Here is my example with row_number.

Also, SQL Fiddle: SO 33608351

with 
trades AS
(
    select 201 as id, 32400.524 as time, 53 as price,       2085 as volume, 'xxx' as foo union all
    select 202, 32400.530, 53,      1162,   'xxx' union all
    select 203, 32400.531, 52.99,       50,     'xxx' union all
    select 204, 32400.532, 52.91,       3119,   'xxx' union all
    select 205, 32400.837, 52.91,       3119,   'xxx' union all
    select 206, 32400.837, 52.91,       3119,   'xxx' union all
    select 207, 32400.837, 52.91,       3119,   'xxx' union all
    select 208, 32400.839, 52.92,       3220,   'xxx' union all
    select 209, 32400.839, 52.92,       3220,   'xxx' union all
    select 210, 32400.839, 52.92,       3220,   'xxx'
),
events as
(
    select 328 as id, 32400.835 as time ,   52.91 as price ,   3119 as volume ,  'yyy' as bar union all
    select 329 , 32400.837 ,   52.91 ,   3119 ,  'yyy' union all
    select 330 , 32400.837 ,   52.91 ,   3119 ,  'yyy' union all
    select 331 , 32400.837 ,   52.91 ,   3119 ,  'yyy' union all
    select 332 , 32400.838 ,   52.91 ,   3119 ,  'yyy' union all
    select 333 , 32400.838 ,   52.91 ,   3119 ,  'yyy' union all
    select 334 , 32400.839 ,   52.92 ,   3220 ,  'yyy' union all
    select 335 , 32400.839 ,   52.92 ,   3220 ,  'yyy' union all
    select 336 , 32400.839 ,   52.92 ,   3220 ,  'yyy' union all
    select 337 , 32400.840 ,   52.91 ,   2501 ,  'yyy'
),
tradesWithRowNumber AS
(
    select   *
            ,ROW_NUMBER() over (PARTITION by time, price, volume order by time, price, volume) as RowNum
    from trades
),
eventsWithRowNumber AS
(
    select   *
            ,ROW_NUMBER() over (PARTITION by time, price, volume order by time, price, volume) as RowNum
    from events
)
select  t.time,
        t.price,
        t.volume,
        t.foo,
        e.bar
FROM    tradesWithRowNumber t
        inner JOIN
        eventsWithRowNumber e   on  e.time = t.time
                                AND e.price = t.price
                                AND e.volume = t.volume
                                and e.RowNum = t.RowNum
B0Andrew
  • 1,725
  • 13
  • 19
  • OK. Note that the first two CTEs are only added as replacements for your original tables in order to make this query stand-alone. – B0Andrew Nov 09 '15 at 12:06
  • Yes, I had understood concerning the two CTEs ^^ It's working perfectly thank you again. To be honest I don't really get how it's working so I'm having a look at the "row_number" and 'partition' documentations until I get it ! – Edouard Berthe Nov 09 '15 at 12:09
0

If I understand correctly, you just want to list the foo and bar columns without creating a Cartesian product. For this purpose, you can introduce a new column using row_number() and join on that:

SELECT *
FROM (SELECT e.*,
             ROW_NUMBER() OVER (PARTITION BY time, price, volume ORDER BY id) as seqnum
      FROM events e
     ) e INNER JOIN
     (SELECT t.*,
             ROW_NUMBER() OVER (PARTITION BY time, price, volume ORDER BY id) as       FROM trades t
seqnum
     ) t
     ON t.time = e.time AND t.price = e.price AND t.volume = e.volume AND
        t.seqnum = e.seqnum;

Your question is unclear on whether you want an inner join, left outer join, or full outer join.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786