1

I have a PostgreSQL table in which some of the rows are dirty because at point of data entry the gender of the record is "unknown" although it always stays the same indicated by its object_id, which is the unique identifier of the object.

object_id                           gender
511E1AC7128EE2E74349896B55461F27    Unknown
511E1AC7128EE2E74349896B55461F27    Unknown
511E1AC7128EE2E74349896B55461F27    Male
56665EA256AB3757CDFA6C1CB4334C91    Female
56665EA256AB3757CDFA6C1CB4334C91    Female
56665EA256AB3757CDFA6C1CB4334C91    Unknown
56665EA256AB3757CDFA6C1CB4334C91    Unknown
55C3BFDBD327396E912604D6E635D59B    Unknown
55C3BFDBD327396E912604D6E635D59B    Unknown
55C3BFDBD327396E912604D6E635D59B    Female

Therefore, I want to update my table to be like this:

object_id                           gender
511E1AC7128EE2E74349896B55461F27    Male
511E1AC7128EE2E74349896B55461F27    Male
511E1AC7128EE2E74349896B55461F27    Male
56665EA256AB3757CDFA6C1CB4334C91    Female
56665EA256AB3757CDFA6C1CB4334C91    Female
56665EA256AB3757CDFA6C1CB4334C91    Female
56665EA256AB3757CDFA6C1CB4334C91    Female
55C3BFDBD327396E912604D6E635D59B    Female
55C3BFDBD327396E912604D6E635D59B    Female
55C3BFDBD327396E912604D6E635D59B    Female

In which all the "unknown" columns are converted to whatever non-unknown value the object_id has in /some other entry/. Is there a way I can do this in PostgreSQL - or - even better - Sequelize (Javascript ORM for SQL databases?)

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
user3264659
  • 341
  • 2
  • 7
  • 20
  • If such a thing is done it would be a patchwork solution at best because your data appears not be normalized. If anyone was to give you a patch work solution he would still need to know what object_state looks like – e4c5 Jul 12 '16 at 01:55
  • The usual suspects are missing: table definition and Postgres version. – Erwin Brandstetter Jul 12 '16 at 02:07
  • I took the liberty to remove the noise around `object_state`, which was orthogonal to the question and only served to distract. – Erwin Brandstetter Jul 12 '16 at 02:36
  • It's normalized. Just the other columns of `object_state` are irrelevant to the question, and I wanted to assert that there are more columns that can have varied information that explains why there are multiple entries with the same UUID. It isn't really a primary key of this table (although it is a primary key of another lookup table). – user3264659 Jul 12 '16 at 18:20

1 Answers1

0

The question is incomplete so I am assuming current Postgres version 9.5 and this table definition:

CREATE TABLE object (
   object_id uuid PRIMARY KEY  -- ideally a UUID
 , gender    text              -- could probably be boolean
 -- rest irrelevant
   );

Then the solution can be:

UPDATE object o
SET    gender = sub.gender
FROM (
   SELECT object_id, min(gender) AS gender
   FROM   object
   GROUP  BY 1
   ) sub
WHERE  o.object_id = sub.object_id
AND    o.gender IS DISTINCT FROM sub.gender;

You need the subquery because aggregate or window functions are not allowed in UPDATE directly.

The aggregate function min() happens to work because the text 'Unknown' sorts after 'Female' and 'Male'. It would also work for boolean (where 'Unknown' would be NULL), and min() and max() ignore NULL values.

The last WHERE condition is optional, but it's wise to include it to avoid empty updates. Can be simplified to o.gender <> sub.gender if the column is defined NOT NULL;

Why would it be ideal to use the data type uuid?

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • thanks. to add on to this, I think it would be good to switch "unknown" to NULL, so then I would change the second line to: `SET gender = NULLIF(sub.gender, 'Unknown')` ? – user3264659 Jul 12 '16 at 18:46
  • @user3264659: Yes, and drop the last line, since you want to update *all* rows in this case. – Erwin Brandstetter Jul 12 '16 at 22:46