1

I'm using a small collection of webscrapers to get the current GPS location of various devices. I also want to keep historic records. What's the best way of doing this without storing the data twice? For now i have two tables, both looking like this:

 Column  |            Type             |   Modifiers   | Storage  | Description
---------+-----------------------------+---------------+----------+-------------
 vehicle | character varying(20)       |               | extended |
 course  | real                        |               | plain    |
 speed   | real                        |               | plain    |
 fix     | smallint                    |               | plain    |
 lat     | real                        |               | plain    |
 lon     | real                        |               | plain    |
 time    | timestamp without time zone | default now() | plain    |

One is named gps, and another is named gps_log. The function that updates these two does two things: first it performs an INSERT on gps_log, and afterwards it does an UPDATE OR INSERT (a user-defined function) on gps. However, this results in what seems to me as a pointless case of double-storing for other purposes than having easy SELECTable access to the current data.

Is there a simple way of only using gps_log and having a function select only the newest entry for each vehicle? Keep in mind that gps_log currently has 1397150 rows increasing with roughly 150 rows every 15 minutes, so performance is likely to be an issue.

Using PostgreSQL 8.4 via Perl DBI.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Jarmund
  • 3,003
  • 4
  • 22
  • 45

1 Answers1

1

If SELECT performance is paramount, your current solution with redundant storage might not be such a bad idea.

If you get rid of the redundant table, you can help SELECT performance with a multi-column index like:

CREATE INDEX gps_log_vehicle_time ON gps_log (vehicle, time DESC);

Assuming that vehicle is your primary key.
Would make this corresponding query pretty fast:

SELECT *
FROM   gps_log
WHERE  vehicle = 'foo'
ORDER  BY time DESC
LIMIT  1;

To SELECT the last entry for multiple or all rows, use this related technique.

Total storage size would probably grow, though, because the index will be bigger that the redundant table (+ index) if you have many rows per vehicle.

It might help storage and performance to add a serial column as a surrogate primary key instead of vehicle. Especially if you have foreign keys pointing to it.

Aside: don't use time as column name. It's a type name in PostgreSQL and a reserved word in every SQL standard. It is also misleading to name a timestamp column time.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Could you please expand on this an provide an efficient way of getting the current state after this index is in place? – Jarmund Nov 28 '12 at 09:12
  • SELECTing a single vehicle always ran with acceptable performance. How would one go about SELECTing all vehicles with the most up to date entry for each, then? – Jarmund Nov 28 '12 at 09:18
  • tested it, and unfortunately, the SELECT does not meet the speed requirements. Accepting your answer, as it hilighted that my redundant storage wasn't as pointless as my gut told me, plus the fact that i'm sure your answer would help others with similar/same situation. – Jarmund Nov 28 '12 at 09:31