0

About 8 months ago I used a suggestion to set up a holding table, then push to the formal table and prevent duplicate entries, per this post: Best way to prevent duplicate data on copy csv postgresql

It's been working very nicely, but today I noticed some errors and gaps in the data.

Here's my insert statement: insert statement

And here's how the index is set up: indexes

And here's an example of the error I'm getting, although on the next chron insert, it went through.error

Here's it going through fine: success

I haven't noted any large changes in the data incoming. Here's what the data looks like that's coming in now: data looks like In summary, I've noticed recent oddities with the insert statements, and success is erratic, resulting in large data gaps in the database. Thanks for any help, and I'm happy to provide more details, but I wanted to see if my information sounds like something someone else has already dealt with.

Thanks very much for any help, S

Community
  • 1
  • 1
user1610717
  • 471
  • 5
  • 16
  • Posting screen shots of text isn't a great idea, why not post the text itself? – mu is too short Apr 12 '16 at 02:30
  • I thought I may get asked that. Because I'm using a remote desktop connection for the db, and it's lame at best when trying to copy/paste. Plus, I thought it was more of a conceptual question versus readers trying to recreate the code and test it. – user1610717 Apr 12 '16 at 02:39
  • Please check if column `"time"` has `null` values. `null` values break `not in` comparisons – Ihor Romanchenko Apr 12 '16 at 08:30
  • I am making the column time "not null" as we speak. It's a big table so taking a while. Thanks for the suggestion! – user1610717 Apr 12 '16 at 22:27

1 Answers1

0

As Gordon pointed out in his answer to your previous question, this approach only works if you have exclusive access to the table. There is a delay between the existence check and the insert itself, and if another process modifies the table during this window, you may end up with duplicates.

If you're on Postgres 9.5+, the best approach is to skip the existence check and simply use an INSERT ... ON CONFLICT DO NOTHING statement.

On earlier versions, the simplest solution (if you can afford to do so) would be to lock the table for the duration of the import. Otherwise, you can emulate ON CONFLICT DO NOTHING (albeit less efficiently) using a loop and an exception handler:

DO $$
DECLARE r RECORD;
BEGIN
  FOR r IN (SELECT * FROM holding) LOOP
    BEGIN
      INSERT INTO ltg_data (pulsecount, intensity, time, lon, lat, ltg_geom)
      VALUES (r.pulsecount, r.intensity, r.time, r.lon, r.lat, r.ltg_geom);
    EXCEPTION WHEN unique_violation THEN
    END;
  END LOOP;
END
$$

As an aside, DELETE FROM holding is time-consuming and probably unnecessary. Create your staging table as TEMP, and it will be cleaned up automatically at the end of your session. You can easily build a table which matches the structure of your import target:

CREATE TEMP TABLE holding (LIKE ltg_data);
Community
  • 1
  • 1
Nick Barnes
  • 19,816
  • 3
  • 51
  • 63
  • Thanks so much for these ideas! I do have exclusive access to the table. It's only modified every 10 min when the insert script is run. I do access the table to view the data on occasion. So should I still try the "insert...on conflict do nothing" statement? And should the "on conflict" portion replace the "where" portion of my original insert statement? Could it be something else now that you know we have exclusive access? Also, it's a 7mil row table. Don't know if that size affects anything. – user1610717 Apr 12 '16 at 22:15
  • The fact that it works when you re-run it indicates that the state of the table has changed; I'm not sure how this could happen without another writer involved. How long does the import take? Any chance that the cron jobs are overlapping? If this isn't it, [statement logging](http://blog.endpoint.com/2012/06/logstatement-postgres-all-full-logging.html) might provide some insight. Re the `ON CONFLICT` clause, it's probably a better approach regardless of your current issues, and yes, it would take the place of your existing `WHERE` clause. – Nick Barnes Apr 13 '16 at 01:49