Hide duplicate column cells in view

Question

First of all, this here doesn't solve my problem at all. I tried this too.

I want generate random unique fake data (first name and address)

I used the following SQL request:

CREATE OR REPLACE VIEW bah AS
SELECT DISTINCT A.VAL AS VORNAME,
    B.VAL AS ADRESSE
FROM ANON.FIRST_NAME A,
    ANON.ADDRESS B
    GROUP BY
    A.val,b.val
    HAVING COUNT(*) = 1;

The result apparently looked like this (the addresses repeat themselves for the next name too):

How do I make the first names AND addresses unique?

Please note this is in CSV format:

Expected result:

"Abbas","Dippelstr. 736, 23892 Hainichen",
"Gilda","Noackallee 6/2, 24711 Malchin",
"Guenter","Fredy-Junk-Gasse 3, 90438 Schmölln",
"Hans-Ulrich","Karl-Peter-Kambs-Gasse 996, 15869 Sulzbach-Rosenberg"

Sample Data:

Column1 (contains oids (which is irrelevant) and vals(relevant! hence listed below):
"Abbas", "Ante","Anthony"... 
Column2: (see above - oids and vals) 
"Benthinstraße 31, 35994 Kleve", "Cordula-Bachmann-Ring 4/8, 06292 Neustrelitz", "Danny-Fischer-Weg 8/9, 28346 Rastatt", "Eckbauergasse 157, 10570 Rudolstadt"

Do you really need to do `CROSS JOIN` by `ANON.FIRST_NAME A, ANON.ADDRESS B` — D-Shih, May 27 '22 at 12:38
@D-Shih We weren't aware this was considered a `CROSS JOIN` already. We just jammed some SQL commands together hoping it would work. I will update the question like 10 minutes with the sample data and expected result — Munchkin, May 27 '22 at 12:41
Your sample data can't align with your expected result. is there any column that can be `JOIN` or related to two tables? — D-Shih, May 27 '22 at 12:59
No need for SELECT DISTINCT here, your GROUP BY returns no duplicates. — jarlh, May 27 '22 at 13:06
There are basically two tables each with `val`s (values) and `oid`s. We can combine/join the values I think, whether it makes sense is another question. The thing we cannot simply use `SELECT` because `val` is ambigious. — Munchkin, May 27 '22 at 13:06
@jarlh apparently as you can see in the sample output there are a lot of duplicates in the screenshot shown. The addresses are bound to each name each and are repeated for each name making them completely not unique. We do this, because we assume that each person lives in a unique address (ofc multiple people can live at one address, but not vice versa) — Munchkin, May 27 '22 at 13:07
Tip of today: Switch to modern, explicit `JOIN` syntax. Easier to write (without errors), easier to read (and maintain), and easier to convert to outer join if needed. — jarlh, May 27 '22 at 13:10
@jarlh tried removing `SELECT DISTINCT` that produces errors and removing `DISTINCT` for some reason makes no difference for some reason — Munchkin, May 27 '22 at 13:11

score 1 · Answer 1 · answered May 30 '22 at 10:23

1

You are using select distinct wrong here.

To achieve the expected result you need to modify your query like so:

select distinct on (ADRESSE) * from bah;

That way your result rows will be unique.

answered May 30 '22 at 10:23

Pascal3366

91
6

Thanks, exactly what I wanted for the main part. – Munchkin May 31 '22 at 12:51
It's hugely inefficient to run `DISTINCT` after a misguided cross-join. For small tables it might not matter much, but this scales with O(N²). – Erwin Brandstetter May 31 '22 at 14:31
@Erwin Brandstetter how can this query be improved ? – Pascal3366 Jun 01 '22 at 10:36
@Pascal3366: Don't cross join to begin with. See my answer. – Erwin Brandstetter Jun 01 '22 at 11:07

Erwin Brandstetter · Accepted Answer · 2022-05-31T14:28:39.893

A CROSS JOIN is the wrong approach to begin with. Multiple FROM items separated by comma (,) are cross-joined. See:

What does [FROM x, y] mean in Postgres?

10 names in anon.first_name and 10 addresses in anon.address form a Cartesian product of 100 rows, with each name and address multiplied by the cardinality of the cross-joined table. Exactly what you don't want. All the confusion with duplicates was created by that. Applying DISTINCT after the cross join is hugely inefficient.

To use each name and address only once, attach a random (or arbitrary) number to each side and join on that.

Assuming (for lack of declaration) that each source table has distinct entries. (Else you need subqueries with DISTINCT to collapse dupes first - window functions like row_number() are applied before DISTINCT on the same query level.)

SELECT *
FROM  (
   SELECT row_number() OVER () AS arbitrary_nr, val AS vorname
   FROM   anon.first_name
   ) a
JOIN (
   SELECT row_number() OVER () AS arbitrary_nr, val AS adresse
   FROM   anon.address
   ) b USING (arbitrary_nr)

Combinations are arbitrary, rather than random. See:

How can I get a random cartesian product in PostgreSQL?

The result has the cardinality of the smaller table. Excess rows from the bigger table are skipped.

To allow some duplicates, you just introduce some duplicate numbers (arbitrary_nr) on one (or both) side(s), either with the same row multiple times, or different rows with the same arbitrary number. For your example:

There could be like 2 persons living at the same address sometimes

SELECT *
FROM  (
   SELECT row_number() OVER (ORDER BY val) AS arbitrary_nr, val AS vorname
   FROM   anon.first_name
   UNION ALL
   SELECT row_number() OVER (ORDER BY val) - 1, val
   FROM   anon.first_name
   WHERE  random() > .9
   ) a
JOIN (
   SELECT row_number() OVER () AS arbitrary_nr, val AS adresse
   FROM   anon.address
   ) b USING (arbitrary_nr)

This adds ~ 10 % of all names a second time. (So some names get two addresses.) This time, names are ordered alphabetically. The random sample starts with 0 (- 1) and numbers can can only increase slower, so the same name can never get the same arbitrary_nr, and the same address is never combined with the same name twice.

The result is still arbitrary (or random) as long as at least one side gets arbitrary (or random) numbers.

There are many ways. Much depends on your exact input, and exact requirements for the result.

Hide duplicate column cells in view

2 Answers2