1

This is what my data looks like. If I do the following query:

select * from gdax_trades where order_type='limit' limit 5;

I get a return that looks like this:

 row_id  |               order_id               |  price  | funds | maker_order_id | taker_order_id | trade_id | product_id |              client_oid              | reason | remaining_size |    size    |  sequence  | side |          time           | order_type | event_type 
---------+--------------------------------------+---------+-------+----------------+----------------+----------+------------+--------------------------------------+--------+----------------+------------+------------+------+-------------------------+------------+------------
 3697499 | 01d63a5b-a5b7-4153-b93d-bd18c249d9c3 | 4113.06 |       |                |                |          | BTC-USD    | 50028bab-81da-4842-98f0-2a1206669567 |        |                |       0.01 | 7446101470 | buy  | 2018-11-29 04:15:39.047 | limit      | received
 3697501 | 9295111b-2e23-445c-9f52-52d2f26fb418 | 4131.93 |       |                |                |          | BTC-USD    | de58f4a6-4577-4680-b083-df34ade6c001 |        |                | 0.12792387 | 7446101472 | sell | 2018-11-29 04:15:39.071 | limit      | received
 3697504 | 4c09878d-8bf9-49d7-9fc7-ca81b7da9e42 | 4131.19 |       |                |                |          | BTC-USD    | a55e0315-8b65-4525-a7a7-debcf6f17bb5 |        |                | 0.10898271 | 7446101475 | sell | 2018-11-29 04:15:39.155 | limit      | received
 3697506 | 0a157570-a811-420e-81ff-0ead9cc34984 | 4132.69 |       |                |                |          | BTC-USD    | 45086077-34be-441e-947f-99fe60bd88ef |        |                | 0.12146031 | 7446101477 | sell | 2018-11-29 04:15:39.24  | limit      | received
 3697508 | e8e1d02f-e627-4eac-a2e5-61c08399d6ef | 4117.83 |       |                |                |          | BTC-USD    | 00000000-818a-0006-0001-000011037107 |        |                |      0.001 | 7446101479 | sell | 2018-11-29 04:15:39.259 | limit      | received
(5 rows)

There are other rows in the table corresponding to each order_id but do not have order_type='limit'. For example, if I try to find all rows that correspond to the first order_id:

select * from gdax_trades where order_id='01d63a5b-a5b7-4153-b93d-bd18c249d9c3';

I get:

 row_id  |               order_id               |  price  | funds | maker_order_id | taker_order_id | trade_id | product_id |              client_oid              |  reason  | remaining_size | size |  sequence  | side |          time           | order_type | event_type 
---------+--------------------------------------+---------+-------+----------------+----------------+----------+------------+--------------------------------------+----------+----------------+------+------------+------+-------------------------+------------+------------
 3697499 | 01d63a5b-a5b7-4153-b93d-bd18c249d9c3 | 4113.06 |       |                |                |          | BTC-USD    | 50028bab-81da-4842-98f0-2a1206669567 |          |                | 0.01 | 7446101470 | buy  | 2018-11-29 04:15:39.047 | limit      | received
 3697500 | 01d63a5b-a5b7-4153-b93d-bd18c249d9c3 | 4113.06 |       |                |                |          | BTC-USD    |                                      |          |           0.01 |      | 7446101471 | buy  | 2018-11-29 04:15:39.047 |            | open
 3697662 | 01d63a5b-a5b7-4153-b93d-bd18c249d9c3 | 4113.06 |       |                |                |          | BTC-USD    |                                      | canceled |           0.01 |      | 7446101633 | buy  | 2018-11-29 04:15:40.522 |            | done
(3 rows)

What I want is a SQLAlchemy query that returns to me rows with an order_id corresponding to a "limit" order. I tried doing a self referential join:

GDAXTradeAlias = aliased(GDAXTrade)

orders = (
    sess
    .query(GDAXTrade)
    .filter( GDAXTrade.time.between(start_dt, end_dt) )
    .filter(GDAXTrade.order_type=='limit')
    .join(GDAXTradeAlias, GDAXTrade.order_id==GDAXTradeAlias.order_id)
    .filter( GDAXTrade.time.between(start_dt, end_dt) )
    .all() )

but that did not get me the desired result. Does anyone have any advice?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
mdornfe1
  • 1,982
  • 1
  • 24
  • 42
  • Postgres version and table definition (`CREATE TABLE` statement) would be helpful (while the many unrelated columns in the sample output are not). – Erwin Brandstetter Dec 02 '18 at 01:21

2 Answers2

1

Many ways possible. I suggest an EXISTS semi-join. Probably fastest an very clear to read:

SELECT *
FROM   gdax_trades g
WHERE  EXISTS (
   SELECT FROM gdax_trades
   WHERE  order_type = 'limit'
   AND    order_id = g.order_id
   );

The SELECT list of an EXISTS expression can stay empty. Only the existence of at least one row is relevant.

We need at least one table alias (g in the example) when addressing the same table twice. I did not table-qualify columns referring to the local table inside the subquery, since that is visible first. Only qualified the reference to the outer query as g.order_id. This is the minimum requirement to be unambiguous. You can be more explicit if you want.

This includes "limit" orders in the result. You can easily exclude them by adding a final:

...
WHERE order_type IS DISTINCT FROM 'limit'

IS DISTINCT FROM because order_type seems to be nullable (unclear whether those are '' or NULL in your sample result ). WHERE order_type <> 'limit' would exclude rows with order_type IS NULL.

The query returns unique rows from the outer table, even if there are multiple "limit" orders with the same order_id. Various alternative query techniques with joins or subqueries return duplicates in that case. Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks for the response! I also posted an answer using a subquery. I'm curious if you think your solution would be more performant? – mdornfe1 Dec 02 '18 at 01:10
  • 1
    `.between(start_dt, end_dt)` aside, there is a functional difference in principle: If your subquery finds more than 1 "limit" order with the same `order_id`, you get duplicates rows from the join, while `EXISTS` returns unique rows - which is also the reason why it's potentially faster: it can stop looking as soon a single row has been found for the same `order_id`. If `order_id` is unique among "limit" orders, there won't be much difference. I would pick `EXISTS` for your case. – Erwin Brandstetter Dec 02 '18 at 01:14
0

I found an answer using sub queries. I'm curious what people think of this

    sub_query = ( 
        sess
        .query(GDAXTrade)
        .filter( GDAXTrade.time.between(start_dt, end_dt) )
        .filter(GDAXTrade.order_type=='limit')
        .subquery() 
        )

    orders = (
        sess
        .query(GDAXTrade)
        .join(sub_query, GDAXTrade.order_id==sub_query.c.order_id, isouter=True)
        .filter(GDAXTrade.order_id==sub_query.c.order_id)
        .filter( GDAXTrade.time.between(start_dt, end_dt) )
        .order_by(GDAXTrade.time.asc())
        .all()
        )
mdornfe1
  • 1,982
  • 1
  • 24
  • 42