Getting last rows by date on a large postgresl data base

Question

I have a large database ( there are 6000 rows inserted per minute ) on a partitioned table, I was working cool when the database was small, now I have a large database. I was using this solution a previous solutions SQL joined by date but it is using 250MB of hard disk and grows up while my table grows, then I decide on change it to an iteration of simple queries, that works well for 10 rows, but get slow with more than 10 cars ( 15 secods for response ) and uses more than 200MB of hard disk.

My question is how to build a good query faster for resolve this issue

Extra info

Query is called on a django app by ajax
I was thinking on iterative ajax calls instead one call with full items list response
Tables are partitioned by day

My actual query is

CREATE OR REPLACE FUNCTION gps_get_last_positions (
    _plates varchar(8)
)
RETURNS TABLE (
    plate varchar,
    device_id integer,
    date_time_process timestamp with time zone, 
    latitude double precision, 
    longitude double precision, 
    course smallint, 
    speed smallint, 
    mileage integer,
    gps_signal smallint,
    gsm_signal smallint,
    alarm_status boolean,
    gsm_status boolean,
    vehicle_status boolean,
    alarm_over_speed boolean,
    other text,
    realtime  double precision
) AS $func$
DECLARE 
    arr varchar[];

BEGIN
    arr := regexp_split_to_array(_plates, E'\\s+');
    FOR i IN 1..array_length(arr, 1) LOOP
        RETURN QUERY SELECT 
        gpstracking_vehicles.registration,
        gpstracking_device_tracks.device_id, 
        gpstracking_device_tracks.date_time_process,
        gpstracking_device_tracks.latitude,
        gpstracking_device_tracks.longitude,
        gpstracking_device_tracks.course,
        gpstracking_device_tracks.speed,
        gpstracking_device_tracks.mileage,
        gpstracking_device_tracks.gps_signal,
        gpstracking_device_tracks.gsm_signal,
        gpstracking_device_tracks.alarm_status,
        gpstracking_device_tracks.gps_status,
        gpstracking_device_tracks.vehicle_status,
        gpstracking_device_tracks.alarm_over_speed,
        gpstracking_device_tracks.other,
        EXTRACT(EPOCH FROM current_timestamp - gpstracking_device_tracks.date_time_process)/60 AS realtime
        FROM (
        gpstracking_devices INNER JOIN (
        gpstracking_vehicles INNER JOIN gpstracking_vehicles_devices ON ( gpstracking_vehicles.id = gpstracking_vehicles_devices.vehicle_id AND gpstracking_vehicles_devices.is_joined = TRUE )
        ) ON ( gpstracking_devices.id = gpstracking_vehicles_devices.device_id AND gpstracking_vehicles_devices.is_joined = TRUE )
        ) INNER JOIN gpstracking_device_tracks ON ( gpstracking_devices.id = gpstracking_device_tracks.device_id )
        WHERE gpstracking_vehicles.registration = arr[i]::VARCHAR
        ORDER BY gpstracking_device_tracks.date_time_process DESC
        LIMIT 1;
    END LOOP;
    RETURN;
END;
$func$ 
LANGUAGE plpgsql VOLATILE SECURITY DEFINER;

Configuration params

application_name            phpPgAdmin_5.0.4                    client
constraint_exclusion        on                                  configuration file
DateStyle                   ISO, MDY                            session
default_text_search_config  pg_catalog.english                  configuration file
external_pid_file           /var/run/postgresql/9.1-main.pid    configuration file
lc_messages                 en_US.UTF-8                         configuration file
lc_monetary                 en_US.UTF-8                         configuration file
lc_numeric                  en_US.UTF-8                         configuration file
lc_time                     en_US.UTF-8                         configuration file
log_line_prefix             %t                                  configuration file
log_timezone                localtime                           environment variable
max_connections             100                                 configuration file
max_stack_depth             2MB                                 environment variable
port                        5432                                configuration file
shared_buffers              24MB                                configuration file
ssl                         on                                  configuration file
TimeZone                    localtime                           environment variable
unix_socket_directory       /var/run/postgresql                 configuration file

My first slow query was:

CREATE OR REPLACE VIEW view_vehicle_devices AS
SELECT 
gpstracking_vehicles_devices.id AS id,
gpstracking_devices.id AS device_id,
gpstracking_vehicles.id AS vehicle_id,
gpstracking_drivers.id AS driver_id,
gpstracking_device_protocols.name AS protocol,
gpstracking_vehicles.registration AS plate,
gpstracking_drivers.firstname as first_name,
gpstracking_drivers.lastname as last_name,
gpstracking_devices.imei,
gpstracking_devices.simcard, 
gpstracking_device_tracks.date_time_process,
gpstracking_device_tracks.latitude,
gpstracking_device_tracks.longitude,
gpstracking_device_tracks.course,
gpstracking_device_tracks.speed,
gpstracking_device_tracks.mileage,
gpstracking_device_tracks.gps_signal,
gpstracking_device_tracks.gsm_signal,
gpstracking_device_tracks.alarm_status,
gpstracking_device_tracks.gps_status,
gpstracking_device_tracks.vehicle_status,
gpstracking_device_tracks.alarm_over_speed,
gpstracking_device_tracks.other,
gpstracking_device_tracks.point,
EXTRACT(EPOCH FROM current_timestamp - gpstracking_device_tracks.date_time_process)/60 realtime,
gpstracking_devices.created,
gpstracking_devices.updated,
gpstracking_devices.is_connected

FROM (
gpstracking_vehicles LEFT JOIN (
gpstracking_drivers  LEFT JOIN gpstracking_vehicles_drivers ON gpstracking_drivers.id = gpstracking_vehicles_drivers.driver_id AND gpstracking_vehicles_drivers.is_joined = TRUE
) ON gpstracking_vehicles.id = gpstracking_vehicles_drivers.vehicle_id AND gpstracking_vehicles_drivers.is_joined = TRUE
) LEFT JOIN (((
gpstracking_device_protocols RIGHT JOIN gpstracking_devices ON gpstracking_device_protocols.id = gpstracking_devices.device_protocol_id
) LEFT JOIN (
SELECT DISTINCT ON (gpstracking_device_tracks.device_id) gpstracking_device_tracks.device_id, 
gpstracking_device_tracks.date_time_process,
gpstracking_device_tracks.latitude,
gpstracking_device_tracks.longitude,
gpstracking_device_tracks.course,
gpstracking_device_tracks.speed,
gpstracking_device_tracks.mileage,
gpstracking_device_tracks.gps_signal,
gpstracking_device_tracks.gsm_signal,
gpstracking_device_tracks.alarm_status,
gpstracking_device_tracks.gps_status,
gpstracking_device_tracks.vehicle_status,
gpstracking_device_tracks.alarm_over_speed,
gpstracking_device_tracks.other,
gpstracking_device_tracks.point
FROM gpstracking_device_tracks 
ORDER BY gpstracking_device_tracks.device_id, gpstracking_device_tracks.date_time_process DESC
) AS gpstracking_device_tracks ON gpstracking_devices.id = gpstracking_device_tracks.device_id  
) LEFT JOIN gpstracking_vehicles_devices ON ( gpstracking_devices.id = gpstracking_vehicles_devices.device_id AND gpstracking_vehicles_devices.is_joined = TRUE )
) ON ( gpstracking_vehicles.id = gpstracking_vehicles_devices.vehicle_id AND gpstracking_vehicles_devices.is_joined = TRUE )

I have changed it for the loop that is starting my post, my loop rujns faster, however is not enought faster as I need

Do you have an index on gpstracking_device_tracks.date_time_process. — Tometzky, Aug 31 '13 at 09:55
Yes, I have it, and it is on each chil table from partitioned gpstracking_device_tracks — Carlos, Aug 31 '13 at 10:21
Can you post some details? Postgresql version, table structures including indexes, the results of `SELECT name, current_setting(name), source FROM pg_settings WHERE source NOT IN ('default', 'override')`. — bma, Aug 31 '13 at 15:44
Can you post the `EXPLAIN (ANALYZE,BUFFERS)` of the slow query as well as the table/index DDL? — bma, Aug 31 '13 at 16:17
Your `shared_buffers=24MB` is way too small for today's computers. I'd set it to about 25% of available RAM on a dedicated machine. — Tometzky, Aug 31 '13 at 17:49
@Tometzky I have done the suggested change with none difference on result — Carlos, Sep 01 '13 at 12:38
@bma my first to slow query explain is on http://explain.depesz.com/s/8CBa and the paste is on the post — Carlos, Sep 01 '13 at 12:41
What are the specifications of the server? Your server is hitting the disk because its exhausting the available memory. — Burhan Khalid, Sep 01 '13 at 12:47
As @BurhanKhalid says, your sort step is spilling to disk. That implies that your `work_mem` is too low, and for this particular query it needs to be set to around 400MB to avoid spilling to disk. How much RAM in the system? Also, your query does not appear to pruning partitions (because you are not using the partitioning key in the query). One suggestion for your plpgsql function is to dynamically construct the query and hit the partitions directly, based on whatever inputs you have, thus eliminating hitting all of the partitions. — bma, Sep 01 '13 at 15:02
Do you have an index on `gpstracking_device_tracks.device_id, date_time_process`? Also, too many* partitions is known to slow down to queries and query planning, so it might behoove you to roll up some of those older partitions into monthly partitions (depending on your access and retention patterns). *too many can be 100 on underpowered hardware, or 20000 on a more powerful machine, it is pretty subjective. — bma, Sep 01 '13 at 15:09
MY server has 4gb of memory ram, I am reading some of the text that you use before, and for been honest I am not so experienced on that, for exmaple I have not konw any about work mem, I am reading that now, probably you have some url for learn more about that. — Carlos, Sep 01 '13 at 21:33
Also I was thinking on make something like a curent day table andd one from historic, I mean for avoid partitions, or use partition on other tables instead the mother table.. I have no clear the process, but the idea is not to have to much tables working on the current day work, just one or two tables or 7 for the last days, and this information goes to other one call historic or something like that. Any suggest? — Carlos, Sep 01 '13 at 21:35
`work_mem` docs: http://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-WORK-MEM. Try this: execute "`SET work_mem TO 400MB;`" then run your query again, you should see a significant speedup. That demonstrates what setting `work_mem` can do. In a server with only 4GB of RAM, you will not be able to leave it that high. Regarding partitioning strategies, a fairly common method: keep 7 or 10 daily partitions, and any data older goes into monthly partitions. Have a scheduled archive process move the data 8 (or 10) days old into the monthly partitions, running every day. — bma, Sep 01 '13 at 22:18

score 2 · Accepted Answer · answered Sep 01 '13 at 22:48

Your problem is that a planner can not know in which partition is an answer to your query. It only has statistics. So you do not benefit from partitioning your data by a day at all.

To benefit from it you can modify your query so it'll look for latest coordinates from current day, if not found then from yesterday, if not found from a day before an so on. I suppose 99% answers would be found in todays partition only.

Or you can partition by for example device_id % 256 instead.

But even better would be to create an additional table with several recent device coordinates only. It would be maintained with a trigger on gpstracking_device_tracks which will just do (pseudocode):

if random()*64 < 1.0 then
  -- statistically once per 64 runs do a cleanup
  with todelete as (
    -- lock rows in particular order to avoid possible deadlocks
    -- if run concurrently
    select id from gpstracking_device_tracks_recent where device_id=?
      order by id for share;
  )
  delete from gpstracking_device_tracks_recent
    where id in (select id from todelete)
end if;
insert into gpstracking_device_tracks_recent (...) values (...);

And then look for latest coordinates in this much smaller table.

I have checken all info provided and my database runs so fast! just I was need to make the right query with right configuration. Thanks to all — Carlos, Sep 02 '13 at 12:24

Getting last rows by date on a large postgresl data base

1 Answers1

Linked