3

I have been trying to find an optimal solution to select unique values from each column. My problem is I don't know column names in advance since different table has different number of columns. So first, I have to find column names and I could use below query to do it:

select column_name from information_schema.columns
where table_name='m0301010000_ds' and column_name like 'c%' 

Sample output for column names:

c1, c2a, c2b, c2c, c2d, c2e, c2f, c2g, c2h, c2i, c2j, c2k, ...

Then I would use returned column names to get unique/distinct value in each column and not just distinct row.

I know a simplest and lousy way is to write select distict column_name from table where column_name = 'something' for every single column (around 20-50 times) and its very time consuming too. Since I can't use more than one distinct per column_name, I am stuck with this old school solution.

I am sure there would be a faster and elegant way to achieve this, and I just couldn't figure how. I will really appreciate any help on this.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Dev Patel
  • 292
  • 1
  • 5
  • 12
  • Since the number of rows is going to be different for every column, and unique column values won't go together any more, how do you want your result? Show an example output. Also, what data types are we talking about? – Erwin Brandstetter May 19 '14 at 19:31
  • I am afraid there is no "faster and elegant way to achieve this", but am interested if there is one ;) – Cedric Simon May 19 '14 at 19:36
  • If you need to optimize performance, provide all the essentials in your question: example table definition (including indexes), cardinalities (how many rows?), how many writes / time unit - or read only? Define and describe the "condition" you mentioned in the title. How selective is it? What forms can it take? How many values in the result? What is the use case? Check the info for the tag I added: [postgresql-performance]. – Erwin Brandstetter May 21 '14 at 20:32

2 Answers2

4

You can't just return rows, since distinct values don't go together any more.

You could return arrays, which can be had simpler than you may have expected:

SELECT array_agg(DISTINCT c1)  AS c1_arr
      ,array_agg(DISTINCT c2a) AS c2a_arr
      ,array_agg(DISTINCT c2b) AS c2ba_arr
      , ...
FROM   m0301010000_ds;

This returns distinct values per column. One array (possibly big) for each column. All connections between values in columns (what used to be in the same row) are lost in the output.

Build SQL automatically

CREATE OR REPLACE FUNCTION f_build_sql_for_dist_vals(_tbl regclass)
  RETURNS text AS
$func$
SELECT 'SELECT ' || string_agg(format('array_agg(DISTINCT %1$I) AS %1$I_arr'
                                     , attname)
                              , E'\n      ,' ORDER  BY attnum)
        || E'\nFROM   ' || _tbl
FROM   pg_attribute
WHERE  attrelid = _tbl            -- valid, visible table name 
AND    attnum >= 1                -- exclude tableoid & friends
AND    NOT attisdropped           -- exclude dropped columns
$func$  LANGUAGE sql;

Call:

SELECT f_build_sql_for_dist_vals('public.m0301010000_ds');

Returns an SQL string as displayed above.

I use the system catalog pg_attribute instead of the information schema. And the object identifier type regclass for the table name. More explanation in this related answer:
PLpgSQL function to find columns with only NULL values in a given table

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks Erwin. I think returning as an array/string would do the trick. I could work with it in my code I believe.Actually I knew that it can't be a row based solution, so was looking something similar as you provided. Again, thanks a bunch!!! – Dev Patel May 19 '14 at 21:17
  • Thanks, but I am not convinced it is faster than the initial proposal... Actually, you have to so a select distinct for each field anyway, so this solution is more "expert" (= complicated to my eyes), but I am not sure it is more efficient. Please let me know if execution time is significantly faster with above solution (or not). – Cedric Simon May 20 '14 at 00:21
  • @CedricSimon: I am not convinced that execution time will be much faster than individual `SELECT DISTINCT`, either - or if that's possible at all. There may be some gain for small to medium tables for saving overhead. I understood the question in terms of finding faster means to *write the query* for a large number of columns. As for complexity, it's simple enough and much less verbose. – Erwin Brandstetter May 20 '14 at 01:01
  • @CedricSimon I did a comparison for execution time and it (returning results in array) took half the execution time (5 seconds vs 9.5 seconds earlier). So still fast, but 5 secs is still slower to use in real time environment. I am still looking for any faster possible option, I hope there will be one :). – Dev Patel May 21 '14 at 13:43
  • @Dev Patel & Erwin Brandstetter: Waow! I am impressed. I was not expecting a notable performance increase. Good sruprise to me. Thanks. – Cedric Simon May 21 '14 at 14:37
  • @CedricSimon Yep, but I still wish to drop it down around 1 or 2 seconds at max for a table of 65000+ rows and 40+ columns. Lets see if someone manages to get more robust solution. – Dev Patel May 21 '14 at 15:09
  • @ErwinBrandstetter - For time being, I am un-marking the accepted answer, so more people would look at it as un-solved question. I hope you don't mind it :). – Dev Patel May 21 '14 at 15:11
  • @DevPatel: I understand and wish you luck. (seriously doubt it, though.) – Erwin Brandstetter May 21 '14 at 15:13
3

If you need this in "real time", you won't be able to archive it using a SQL that needs to do a full table scan to archive it.

I would advise you to create a separated table containing the distinct values for each column (initialized with SQL from @Erwin Brandstetter ;) and maintain it using a trigger on the original table.

Your new table will have one column per field. # of row will be equals to the max number of distinct values for one field.

For on insert: for each field to maintain check if that value is already there or not. If not, add it.

For on update: for each field to maintain that has old value != from new value, check if the new value is already there or not. If not, add it. Regarding the old value, check if any other row has that value, and if not, remove it from the list (set field to null).

For delete : for each field to maintain, check if any other row has that value, and if not, remove it from the list (set value to null).

This way the load mainly moved to the trigger, and the SQL on the value list table will super fast.

P.S.: Make sure to pass all you SQL from trigger to explain plan to make sure they use best index and execution plan as possible. For update/deletion, just check if old value exists (limit 1).

Cedric Simon
  • 4,571
  • 4
  • 40
  • 52
  • Thanks for the suggestion. Since the table(s) I am using has distint rows with many combinations of column values (>30). And I have about 20+ such tables.So not sure what do you mean by table with distinct values. Did you mean one table for each column's distinct values? My main challenge is to retrieve information (SELECT), from pre-existing tables. – Dev Patel May 27 '14 at 12:54
  • I mean that one column will hold each distinct values for one field. One column per field. Then just select not null values from desired column. – Cedric Simon May 27 '14 at 13:20
  • Yep, agreed. But I have to select values based on condition(s) and so based on the conditions the results would be unique combinations of 1000s of rows. So I am still wondering how to keep each distinct values for one field. ......I will try to provide a snippet/example of data, which may probably explain it better. – Dev Patel May 27 '14 at 17:03
  • Mmmm... Go to the store, buy some GB of ram, and put all your tables in memory :D. LOL – Cedric Simon May 27 '14 at 17:42