5

I have a bit of an "upsert" type of question... but, I want to throw it out there because it's a little bit different than any that I've read on stackoverflow.

Basic problem.

I'm working on moving from mysql to PostgreSQL 9.1.5 (hosted on Heroku). As a part of that, I need to import multiple CSV files everyday. Some of the data is sales information and is almost guaranteed to be new and need to be inserted. But, other parts of the data is almost guaranteed to be the same. For example, the csv files (note plural) will have POS (point of sale) information in them. This rarely changes (and is most likely only via additions). Then there is product information. There are about 10,000 products (vast majority will be unchanged, but it's possible to have both additions and updates).

The final item (but is important), is that I have a requirement to be able to provide an audit trail/information for any given item. For example, if I add a new POS record, I need to be able to trace that back to the file it was found in. If I change a UPC code or description of a product, then I need to be able to trace it back to the import (and file) where the change came from.

Solution that I'm contemplating.

Since the data is provided to me via CSV, then I'm working around the idea that COPY will be the best/fastest way. The structure of the data in the files is not exactly what I have in the database (i.e. final destination). So, I'm copying them into tables in the staging schema that match the CSV (note: one schema per datasource). The tables in the staging schemas will have a before insert row triggers. These triggers can decide what to do with the data (insert, update or ignore).

For the tables that are most likely to contain new data, then it will try to insert first. If the record is already there, then it will return NULL (and stop the insert into the staging table). For tables that rarely change, then it will query the table and see if the record is found. If it is, then I need a way to see if any of the fields are changed. (because remember, I need to show that the record was modified by import x from file y) I obviously can just boiler plate out the code and test each column. But, was looking for something a little more "eloquent" and more maintainable than that.

In a way, what I'm kind of doing is combining a importing system with an audit trail system. So, in researching audit trails, I reviewed the following wiki.postgresql.org article. It seems like the hstore might be a nice way of getting changes (and being able to easily ignore some columns in the table that aren't important - e.g. "last_modified")

I'm about 90% sure it will all work... I've created some testing tables etc and played around with it.

My question?

Is a better, more maintainable way of accomplishing this task of finding the maybe 3 records out of 10K that require a change to the database. I could certainly write a python script (or something else) that reads the file and tries to figure out what to do with each record, but that feels horribly inefficient and will lead to lots of round trips.

A few final things:

  1. I don't have control over the input files. I would love it if they only sent me the deltas, but they don't and it's completely outside of my control or influence.
  2. he system is grow and new data sources are likely to be added that will greatly increase the amount of data being processed (so, I'm trying to keep things efficient)
  3. I know this is not nice, simple SO question (like "how to sort a list in python") but I believe one of the great things about SO is that you can ask hard questions and people will share their thoughts about how they think the best way to solve it is.
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
David S
  • 12,967
  • 12
  • 55
  • 93
  • Two (final) questions: 1) do you have deletes, or is the input "incremental"? 2) can the suppliers of the data garantee *stable keys* (no key updates) ? – wildplasser Sep 20 '12 at 23:19
  • It various with the data source and type of data. It's definitely a situation where I should probably be on the "defensive" when handling the data and be prepared for just about anything. That said, I think I could have deletes (but, rare) and I believe the keys *should* be stable (in other words, the ID for the POS record should remain the same between uploads). – David S Sep 21 '12 at 14:26

1 Answers1

7

I have lots of similar operations. What I do is COPY to temporary staging tables:

CREATE TEMP TABLE target_tmp AS
SELECT * FROM target_tbl LIMIT 0;  -- only copy structure, no data

COPY target_tmp FROM '/path/to/target.csv';

For performance, run ANALYZE - temp. tables are not analyzed by autovacuum!

ANALYZE target_tmp; 

Also for performance, maybe even create an index or two on the temp table, or add a primary key if the data allows for that.

ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id);

You don't need the performance stuff for small imports.

Then use the full scope of SQL commands to digest the new data.
For instance, if the primary key of the target table is target_id ..

Maybe DELETE what isn't there any more?

DELETE FROM target_tbl t
WHERE NOT EXISTS (
   SELECT 1 FROM target_tmp t1
   WHERE  t1.target_id = t.target_id
);

Then UPDATE what's already there:

UPDATE target_tbl t
SET    col1 = t1.col1
FROM   target_tmp t1
WHERE  t.target_id = t1.target_id

To avoid empty UPDATEs, simply add:

...
AND    col1 IS DISTINCT FROM t1.col1; -- repeat for relevant columns

Or, if the whole row is relevant:

...
AND    t IS DISTINCT FROM t1;         -- check the whole row

Then INSERT what's new:

INSERT INTO target_tbl(target_id, col1)
SELECT t1.target_id, t1.col1
FROM   target_tmp t1
LEFT   JOIN target_tbl t USING (target_id)
WHERE  t.target_id IS NULL;

Clean up if your session goes on (temp tables are dropped at end of session automatically):

DROP TABLE target_tmp;

Or use ON COMMIT DROP or similar with CREATE TEMP TABLE.
Code untested, but should work in any modern version of PostgreSQL except for typos.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks for the reply. Good example/detail; I'm sure this will help some people. And it's very close to what I'm needing to do. The one requirement that's getting me is the tracking keep track of the import "batch id" actually changed the record. In other words, I don't want always right over/refresh the record if it exists - only update if there are changes. I suppose I could. And then just have an update trigger on the table and figure out there if the record truly changed... if so, then add a audit trail record. Seem reasonable? Got a better way? – David S Sep 19 '12 at 21:56
  • @DavidS: This may be simpler than you expect. I added a bit to the UPDATE section. It's almost always a good idea to exclude empty updates anyway. If you need an audit trail, I'd suggest you copy outdated versions (plus timestamp) to a history table before you `DELETE` / `UPDATE`. – Erwin Brandstetter Sep 19 '12 at 22:00
  • good suggestions and thanks for updating the answer. For auditing purposes, do you prefer to copy to archive table (a full copy of the data + timestamp) over just using a hstore(x.*) and storing it in a text field in a audit_trail/history type table? The main advantage of the archive table is that it's easily queried. The main advantage of the hstore approach seems to flexibility if your schema changes down the road. I know it's a bit off topic and could probably be a question of it's own, but curious to your thoughts. Thanks! – David S Sep 19 '12 at 22:24
  • @DavidS: Well, it is a bit off topic and really depends on your requirements. My first impulse would be to go with a copy of the table + timestamp, because it's simple and you can use all the same queries for archive and live table. – Erwin Brandstetter Sep 19 '12 at 22:28
  • Avoiding the updates if nothing would change will avoid a lot of unnecessary writes (+ dirty rows + triggers, + vacuum). Timestamps + history are the first choice. I don't think that timestamps mix well with the "is distinct from ..." construct. – wildplasser Sep 19 '12 at 22:34
  • @wildplasser - agreed about the timestamps. But, I think you just have to include each column that is relevant (as I think Erwin was trying to point out). I'm still sort of intrigued with a triggered based system for the updates because it seems like you could almost handle it with a generic solution and just use TG_ARGVs for each table for the columns you want to ignore. In other words, (hstore(NEW.*)-hstore(OLD.*))-TG_ARGV[0]).. if no differences, then return null; – David S Sep 19 '12 at 22:43