1

I am building an app where the server needs to select rows based on some criteria/filters. One of them is the location of the user and the radius at which the user want's to see posts and other filters such date range and filter for a value of another column. This is going to be for an ad-hoc event discovery app.

I have read about PostGIS and I know there is a point datatype. Based on this answer I understood that it is better to order from equality to range columns, even though I feel like geo point column should be the first. But the main question is, how is it possible to create such an index? I thought about GiST index but not sure that helps.

Suppose the following simplified events table (disregard the validity position data):

id  event_title                  event_position   event_type  is_public  start_date
    (varchar)                    (point lat/lon)  (smallint)  (boolean)  (timestamptz)
--  ---------------------------  ---------------  ---------   ---------  ----
 1  "John's Party"               (122,35)         0           0          2020-07-05
 2  "Revolution then Starbucks"  (123,30)         1           1          2020-07-06
 3  "Study for math exam"        (120,36)         2           1          2020-07-07
 4  "Party after exam"           (120,36)         1           1          2020-07-08
 5  "Hiking next to the city"    (95,40)          3           1          2020-07-09
 6  "Football match"             (-42,31)         4           1          2020-07-10

So in this table a user would be able to query public events close to (122,34) by 100km (suppose first three rows fall into this area) and of event types 0, 1 or 2 falling between dates 2020-07-05 and 2020-07-07. The user would get the rows with ID 2 and 3.

This is the query I want to optimize with an appropriate index. Thank you!

Najib
  • 199
  • 1
  • 8
  • You have to describe what exactly you want to do, it is not clear to me. With PostGIS, you don't use `point`, but `geometry` or `geography`. And yes, it's going to be a GiST index. – Laurenz Albe Jul 03 '20 at 19:00
  • @Laurenz Albe I know, point is native and geometry is from PostGIS. I think you understood it well. There is one query I am not sure how to optimize and it is this query I described in the beginning: retrieving rows that match rows where geo point is within a distance with another point, and another column falls into a range and another column is in a set of possible values. – Najib Jul 03 '20 at 21:22
  • Your question seems to be quite relevant. However, it's not very clear to me. Please include a few rows of sample data, and an example of how do you want to use it. – The Impaler Jul 03 '20 at 22:01
  • @Najib Well... will you add an example? – The Impaler Jul 03 '20 at 22:26
  • @TheImpaler done! – Najib Jul 03 '20 at 22:31
  • You don't need an index for a table with 6 rows. And 6 rows is not enough to give us the flavor of what your real data is. Could you come up with a random number generator or something which generates vaguely realistic data? – jjanes Jul 04 '20 at 14:26
  • @jjanes thanks for the suggestion but you must be kidding me. Obviously there will be more than 6 rows and I think the "flavor" is pretty easily seen. It's an event table. As for a "random" data generator I am not sure what you are asking for. Title can be anything, doesn't matter (hope this is obvious). For event_position latitude longitude coordinates on earth (that we both live on). Event type let's say could be from 0 to 10, is_public I hope is obvious, and start_date well again, human calendar dates from now on. You could image 10000 rows, I hope you agree then it needs an index. :) – Najib Jul 04 '20 at 14:48
  • It might not have been obvious, but I don't have any real data yet. I am only designing. – Najib Jul 04 '20 at 15:00
  • It takes 2ms to scan 10,000 points with no index, so no, not obvious at all that you even need an index. The best index will depend on how selective each part of your WHERE clause, which we can't know from this. But `point` represents a spot on a 2D-plane, so can't accurately represent lat and lon on the surface of the whole earth, only smallish chunks of it. Between 2020-07-05 and 2020-07-07 is quite different if all your events are all in the next week, versus spanning the next decade or century. – jjanes Jul 04 '20 at 16:14
  • @jjanes Okay, sorry. I am sure I will need an index, because otherwise it might still not take too long but I want it ti be as fast as possible and based on my experience when there are thousands of rows an index definitely helps. As for ``point`` you are totally right, I might need to use geography type but that I think I will be able to decide, total accuracy might not be important. As for the range constraints they will never be in the past, usually until next week, at most from today until end of next month. – Najib Jul 04 '20 at 17:05
  • @jjanes I have to clarify about the date range: the records start from the launch of the software, but the records the users will need are from current date until next week, at most next months – Najib Jul 04 '20 at 17:19
  • @jjanes so sure the records might span a decade if my app runs for a decade.. – Najib Jul 04 '20 at 17:20
  • @jjanes maybe I could separate future and past events into different tables actually overnight, but Honestly, we don't need to get into it in such details! – Najib Jul 04 '20 at 17:22
  • Perhaps my question is an XY problem, I am not sure by now.. – Najib Jul 04 '20 at 17:24

1 Answers1

2

With the help of the btree_gist extension, you can include the event_type and start_date columns into the GiST index along with the event_position. However, the event_type won't be very useful in the index as long as the restriction clause is something like event_type in (0, 1, 2). (But if the list only contains one element, it will be rewritten into an equality, and in that case it can use that column in the index efficiently.) So using the other two columns would be my starting point. I would put the usually more selective one first. If you are going to leave past events in the table rather than clearing them out, then the date is probably going to eventually be the more selective one.

If the location condition is not very selective at all (e.g. most of your events are in NYC, and most of your users are in NYC, and almost everything is within 100km of almost everything else), then you might want a different approach. Just make a default BTREE index on (event_type, start_date). Unlike GiST, such a BTREE can effectively use a condition like event_type in (0, 1, 2) along with AND start_date between x and y.

I don't think that there is a fundamental reason GiST can't use the in-list effectively while BTREE can. Maybe that will be fixed for GiST in some future version.

jjanes
  • 37,812
  • 5
  • 27
  • 34
  • Wow, thank you very much! I did feel it had to be an index that uses both GiST and B-Tree for the appropriate fields, but I wouldn't have found this extension. As for your suggestions I understand completely. I was actually doubting there is a solution to create an index that indexes 2D, integer and array types, eventhough as you said theoretically it should be possible, but needs implementation (and I am not the one yet to do it). I am thinking to use bitwising on the resulting rows for the main event types/categories or use multiple category columns. This is exactly what I need, Thanks! – Najib Jul 05 '20 at 17:53
  • I will generate some data and post my results – Najib Jul 05 '20 at 18:05