7

I have imported tables from Postgres to hdfs by using sqoop. My table have uuid field as primary key and my command sqoop as below:

sqoop import --connect 'jdbc:postgresql://localhost:5432/mydb' --username postgreuser --password 123456abcA --driver org.postgresql.Driver --table users --map-column-java id=String --target-dir /hdfs/postgre/users --as-avrodatafile --compress -m 2

But I got the error:

Import failed: java.io.IOException: org.postgresql.util.PSQLException: ERROR: function min(uuid) does not exist

I tried executed the sql command: SELECT min(id) from users and got the same error. How could I fix it ? I use Postgres 9.4, hadoop 2.9.0 and sqoop 1.4.7

hazzy
  • 167
  • 1
  • 2
  • 9
  • 4
    You can cast the `uuid` column to text: e.g.: `min(id::text)` –  Mar 12 '18 at 06:59
  • Thanks @a_horse_with_no_name for answer, but how could i cast the uuid in sqoop. I try --map-column-java id=String , but didn't work. – hazzy Mar 12 '18 at 09:30

5 Answers5

5

I'd like to credit @robin-salih 's answer, I've used it and implementation of min for int, to build following code:

CREATE OR REPLACE FUNCTION min(uuid, uuid)
RETURNS uuid AS $$
BEGIN
    IF $2 IS NULL OR $1 > $2 THEN
        RETURN $2;
    END IF;

    RETURN $1;
END;
$$ LANGUAGE plpgsql;


create aggregate min(uuid) (
  sfunc = min,
  stype = uuid,
  combinefunc = min,
  parallel = safe,
  sortop = operator (<)
);

It almost the same, but takes advantages of B-tree index, so select min(id) from tbl works in few millis.

P.S. I'm not pgsql expert, perhaps my code is somehow wrong, double check before use in production, but I hope it uses indexes and parallel execution correctly. I've made it just from sample code, not digging into theory behind aggregates in PG.

Bogdan Mart
  • 460
  • 8
  • 19
2

Postgres doesn't have built-in function for min/max uuid, but you can create your own using the following code:

CREATE OR REPLACE FUNCTION min(uuid, uuid)
RETURNS uuid AS $$
BEGIN
    IF $2 IS NULL OR $1 > $2 THEN
        RETURN $2;
    END IF;

    RETURN $1;
END;
$$ LANGUAGE plpgsql;


CREATE AGGREGATE min(uuid)
(
    sfunc = min,
    stype = uuid
);
Robin Salih
  • 529
  • 1
  • 6
  • 21
2

I found the answer's provided by @robin-salih and @bodgan-mart to be a great starting point but ultimately incorrect. Here's a solution which worked better for me:

    CREATE FUNCTION min_uuid(uuid, uuid)
    RETURNS uuid AS $$
    BEGIN
        -- if they're both null, return null
        IF $2 IS NULL AND $1 IS NULL THEN
            RETURN NULL ;
        END IF;

        -- if just 1 is null, return the other
        IF $2 IS NULL THEN
            RETURN $1;
        END IF ;
        IF $1 IS NULL THEN
            RETURN $2;
          END IF;

        -- neither are null, return the smaller one
        IF $1 > $2 THEN
            RETURN $2;
        END IF;

        RETURN $1;
    END;
    $$ LANGUAGE plpgsql;


    create aggregate min(uuid) (
      sfunc = min_uuid,
      stype = uuid,
      combinefunc = min_uuid,
      parallel = safe,
      sortop = operator (<)
    );

For more details, see my post at How to select minimum UUID with left outer join?

Paymahn Moghadasian
  • 9,301
  • 13
  • 56
  • 94
1

I am defining min/max aggregates for uuids using least/greatest which I believe should give the best performance as those are native to postgres (but I haven't benchmarked it).

Since least/greatest are special forms (to my understanding) I have to proxy them using a function which I am marking as immutable and parallel safe.

least/greatest already have proper null-handling behavior.

I am using these in production on Postgres 13.

create or replace function min(uuid, uuid)
    returns uuid
    immutable parallel safe
    language plpgsql as
$$
begin
    return least($1, $2);
end
$$;

create aggregate min(uuid) (
    sfunc = min,
    stype = uuid,
    combinefunc = min,
    parallel = safe,
    sortop = operator (<)
    );

create or replace function max(uuid, uuid)
    returns uuid
    immutable parallel safe
    language plpgsql as
$$
begin
    return greatest($1, $2);
end
$$;

create aggregate max(uuid) (
    sfunc = max,
    stype = uuid,
    combinefunc = max,
    parallel = safe,
    sortop = operator (>)
    );
scravy
  • 11,904
  • 14
  • 72
  • 127
-3

This is not a issue with sqoop. Postgres doesn't allow min/max on uuid. Each uuid is unique and is not considered bigger/smaller than other.

To fix this in sqoop you might need to use some other field as the split-by key. I used created_At timestamp as my split-by key instead.

Sagar D
  • 188
  • 1
  • 12