0

I'm trying to write a stored procedure for selecting X amount of well spread points in time from a big table.

I have a table points:

  "Userid" integer
, "Time"   timestamp with time zone
, "Value"  integer 

It contains hundreds of millions of records. And about a million of records per each user.

I want to select X points (lets say 50), which all well spread from time A to time B. The problem is that the points are not spread equally (if one point is in 6:00:00, the next point may be after 15 seconds, 20, or 4 minutes for example).

Selection all the points for an id can take up to 60 seconds (because there are about a million points).

Is there any way to select the exact amount of points I desire, as much well spread as possible, in a fast way?

Sample data:

   +--------+---------------------+-------+
   | UserId | Time                | Value |
   +--------+---------------------+-------+
1  | 1      | 2017-04-10 14:00:00 | 1     |
2  | 1      | 2017-04-10 14:00:10 | 10    |
3  | 1      | 2017-04-10 14:00:20 | 32    |
4  | 1      | 2017-04-10 14:00:35 | 80    |
5  | 1      | 2017-04-10 14:00:58 | 101   |
6  | 1      | 2017-04-10 14:01:00 | 203   |
7  | 1      | 2017-04-10 14:01:30 | 204   |
8  | 1      | 2017-04-10 14:01:40 | 205   |
9  | 1      | 2017-04-10 14:02:02 | 32    |
10 | 1      | 2017-04-10 14:02:15 | 7     |
11 | 1      | 2017-04-10 14:02:30 | 900   |
12 | 1      | 2017-04-10 14:02:45 | 22    |
13 | 1      | 2017-04-10 14:03:00 | 34    |
14 | 1      | 2017-04-10 14:03:30 | 54    |
15 | 1      | 2017-04-10 14:04:00 | 54    |
16 | 1      | 2017-04-10 14:06:00 | 60    |
17 | 1      | 2017-04-10 14:07:20 | 654   |
18 | 1      | 2017-04-10 14:07:40 | 32    |
19 | 1      | 2017-04-10 14:08:00 | 33    |
20 | 1      | 2017-04-10 14:08:12 | 32    |
21 | 1      | 2017-04-10 14:10:00 | 8     |
   +--------+---------------------+-------+

I want to select 11 "best" points from the list above, for the user with Id 1, from time 2017-04-10 14:00:00 to 2017-04-10 14:10:00.

Currently its done on the server, after selecting all the points for the user. I calculate the "best times" by dividing the difference in times and get a list such as: 14:00:00,14:01:00,....14:10:00 (11 "best times", as the amount of points). Than I look for the closest point for each "best time", that not have been selected yet. The result will be points: 1, 6, 9, 13, 15, 16, 17, 18, 19, 20, 21

Edit:

I'm trying something like this:

SELECT * FROM "points"
WHERE "Userid" = 1 AND
(("Time" =
(SELECT "Time" FROM 
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:00:00' - "Time"))
LIMIT 1)) OR
("Time" =
(SELECT "Time" FROM 
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:01:00' - "Time"))
LIMIT 1)) OR
("Time" =
(SELECT "Time" FROM 
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:02:00' - "Time"))
LIMIT 1)))

The problems here are that:
A) It doesn't take in account points that already have been selected.
B) Because of the ORDER BY, each additional time increases the running time of the query by ~ 1 second, and for 50 points I get back to the 1 minute mark.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
andrey
  • 51
  • 1
  • 5
  • 1
    Sample data, desired results, and your current code would really help. – Gordon Linoff May 10 '17 at 12:20
  • It has been done sir. – andrey May 10 '17 at 13:21
  • Because their times are too close to the first point, and and In order to get 11 best (well spread) points, I need a point as close as possible to the times 14:00, 14:01, 14:02, 14:03.... 14:10 – andrey May 10 '17 at 13:58
  • The `"best" points` are still not well defined. "Well spread" is a fuzzy term open to interpretation. What you implement finds the closest points to a given time grid, which seems like a good approximation, but it does *not* necessarily minimize the sum of time differences between selected rows. And your version of Postgres should *always* be disclosed. – Erwin Brandstetter May 15 '17 at 11:31

2 Answers2

1

There is an optimization problem behind your question that's hard to solve with just SQL.

That said, your attempt of an approximation can be implemented to use an index and show good performance irregardless of table size. You need this index if you don't have it already:

CREATE INDEX ON points ("Userid", "Time");

Query:

SELECT *
FROM   generate_series(timestamptz '2017-04-10 14:00:00+0'
                     , timestamptz '2017-04-10 14:09:00+0'  -- 1 min *before* end!
                     , interval    '1 minute') grid(t)
LEFT  JOIN LATERAL (
   SELECT *
   FROM   points
   WHERE  "Userid" = 1
   AND    "Time" >= grid.t
   AND    "Time" <  grid.t + interval '1 minute'  -- same interval
   ORDER  BY "Time"
   LIMIT  1
   ) t ON true;

dbfiddle here

Most importantly, the rewritten query can use above index and will be very fast, solving problem B).

It also addresses problem A) to some extent as no point is returned more than once. If there is no row between two adjacent points in the grid, you get no row in the result. Using LEFT JOIN .. ON true keeps all grid rows and appends NULL in this case. Eliminate those NULL rows by switching to CROSS JOIN. You may get fewer result rows this way.

I am only search ahead of each grid point. You might append a second LATERAL join to also search behind each grid point (just another index-scan), and take the closer one of the two results (ignoring NULL). But that introduces two problems:

  • If one match is behind and the next is ahead, the gap widens.
  • You need special treatment for lower and / or upper bound of the outer interval
  • And you need two LATERAL joins with two index scans.

You could use a recursive CTE to search 1 minute ahead of the last time actually found, but then the total number of rows returned varies even more.

It all comes down to an exact definition of what you need, and where compromises are allowed.

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Hi, Erwin, can you further explain query `LIMIT 1` part. I feel like it the table `points` and `grid` both are ordered. then `grid` table "pipe" one row one by one to points for evaluation related to `points`. once one row meet certain criteria in table `points` then grid will pipe the next row. feels like recursive query search a group row min/max value. – jian Dec 07 '22 at 10:00
  • @jian: Find ample explanation in the linked answers. Most exhaustively in [the last one](https://stackoverflow.com/a/25536748/939860). I added one more link to explain the nature of `LATERAL` joins. – Erwin Brandstetter Dec 07 '22 at 12:04
0

answer use generate_series('2017-04-10 14:00:00','2017-04-10 14:10:00','1 minute'::interval) and join for comparison.

for others to save time on data set:

t=# create table points(i int,"UserId" int,"Time" timestamp(0), "Value" int,b text);
CREATE TABLE
Time: 13.728 ms
t=# copy points from stdin delimiter '|';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1  | 1      | 2017-04-10 14:00:00 | 1     |
>> 2  | 1      | 2017-04-10 14:00:10 | 10    |
3  | 1      | 2017-04-10 14:00:20 | 32    |
4  | 1      | 2017-04-10 14:00:35 | 80    |
5  | 1      | 2017-04-10 14:00:58 | 101   |
6  | 1      | 2017-04-10 14:01:00 | 203   |
7  | 1      | 2017-04-10 14:01:30 | >> 204   |
8  | 1      | 2017-04-10 14:01:40 | 205   |
9  | 1      | 2017-04-10 14:02:02 | 32    |
10 | 1      | 2017-04-10 14:02:15 | 7     |
11 | 1      | 2017-04-10 14:02:30 | 900   |
12 | 1      | 2017-04-10 14:02:45 | 22    |
>> >> >> >> >> >> >> >> >> >> 13 | 1      | 2017-04-10 14:03:00 | 34    |
14 | 1      | 2017-04-10 14:03:30 | 54    |
15 | 1      | 2017-04-10 14:04:00 | 54    |
16 | 1      | 2017-04-10 14:06:00 | 60    |
17 | 1      | 2017-04-10 14:07:20 | 654   |
18 | 1      | 2017-04-10 14:07:40 | 32    |
19 | 1      | 2017-04-10 14:08:00 | 33    |
20 | 1      | 2017-04-10 14:08:12 | 32    |
21 | 1      | 2017-04-10 14:10:00 | 8     |>> >> >> >> >> >> >> >> \.
>> \.
COPY 21
Time: 7684.259 ms
t=# alter table points rename column "UserId" to "Userid";
ALTER TABLE
Time: 1.013 ms

Frankly I don't understand the request. This is how I got it from description and results are different from expected by OP:

t=# with r as (
  with g as (
    select generate_series('2017-04-10 14:00:00','2017-04-10 14:10:00','1 minute'::interval) s
  )
  select *,abs(extract(epoch from '2017-04-10 14:02:00' - "Time"))
  from g
  join points on g.s = date_trunc('minute',"Time")
  order by abs
  limit 11
)
select i, "Time","Value",abs
from r
order by i;
 i  |        Time         | Value | abs
----+---------------------+-------+-----
  4 | 2017-04-10 14:00:35 |    80 |  85
  5 | 2017-04-10 14:00:58 |   101 |  62
  6 | 2017-04-10 14:01:00 |   203 |  60
  7 | 2017-04-10 14:01:30 |   204 |  30
  8 | 2017-04-10 14:01:40 |   205 |  20
  9 | 2017-04-10 14:02:02 |    32 |   2
 10 | 2017-04-10 14:02:15 |     7 |  15
 11 | 2017-04-10 14:02:30 |   900 |  30
 12 | 2017-04-10 14:02:45 |    22 |  45
 13 | 2017-04-10 14:03:00 |    34 |  60
 14 | 2017-04-10 14:03:30 |    54 |  90
(11 rows)

I added abs column to justify why I thought those rows fit request better

Vao Tsun
  • 47,234
  • 13
  • 100
  • 132