0

Working with Greenplum 6 (based on PostgreSQL 9.4.), I have these sample tables:

Table SAMPLE_A:

| uniqueId | SampleSz |
| -------- | -------- |
| 1        | 25       |
| 2        | 50450    |
| 3        | 9        |

Table SAMPLE_B:

| IP       | uniqueId |
| -------- | -------- |
| 1.4.4.5  | (1,2,3)  |
| 2.5.6.7  | (2)      |
| 3.4.7.8  | (1,3)    |

I am trying to create a new table from the above:
For uniqueId=1 pull a random set of 25 (SAMPLE_A.SampleSz) IPs (SAMPLE_B.IP) where SAMPLE_A.uniqueId is in the array of SAMPLE_B.uniqueId. Then iterate to the next SAMPLE_A.uniqueId and pull random 50450 ... etc.
Actual tables are a bit more complex, but I'm stuck here.

I was trying to write this for a singular record (and failed):

select i.ip, s.uniqueId
from SAMPLE_A s
join lateral (
   select distinct ip
   from SAMPLE_B i
   where s.uniqueId = any(i.uniqueId)
-- ORDER BY random()
   LIMIT s.SampleSz
   ) i on true

This throws an unmarshalling error. Even if it worked, it wouldn't solve my full issue, but I figured it was a first step.

***** UPDATE 1: Desired Result Set ***** I was going to do the result set as if the above tables are the entirety of the data set, but that really won't work (as it would show basically as if I just unnested the second table). So let's pretend that we are working (solely) with EVERY IP between 3.4.7.0 and 3.4.7.255; each has a record in SAMPLE_B with all 3 uniqueIds (1,2,3). For UniqueId 2, I would get the all 256 IPs (as it is less than the associated sample size (50,450). I would get 25 "random" records for 1. And for 3, I would get something like below (9 random records). Obviously, 2 will overlap completely (with 1&3); 1 & 3 could overlap each other on 0 to 9 records:

| IP       | uniqueId |
| -------- | -------- |
| 3.4.7.25 | 3        |
| 3.4.7.5  | 3        |
| 3.4.7.7  | 3        |
| 3.4.7.8  | 3        |
| 3.4.7.84 | 3        |
| 3.4.7.61 | 3        |
| 3.4.7.112| 3        |
| 3.4.7.125| 3        |
| 3.4.7.194| 3        |
| 3.4.7.207| 3        |
| 3.4.7.11 | 3        |
| 3.4.7.8  | 1        |
| 3.4.7.1  | 1        |
S Ayo
  • 13
  • 5

1 Answers1

0

Your query works in my hands on Postgres 9.4.
The only true problem is ORDER BY random(), which cannot be combined with SELECT DISTINCT, where ORDER BY only accepts items from the same SELECT list.

SELECT b.ip, a.uniqueid
FROM   sample_a a
CROSS  JOIN LATERAL (
   SELECT DISTINCT b.ip
   FROM   sample_b b
   WHERE  b.uniqueid @> a.uniqueid  -- array operator
   -- ORDER BY random()  -- not possible like this!!
   LIMIT  s.samplesz
   ) b;

I switched to the array operator because that can be supported with an index. At least in Postgres. Not sure where Greenplum stands on GIN indexes. See:

If you need a random sample, you need to do more. A simple solution would be a subquery like:

SELECT b.ip, a.uniqueid
FROM   sample_a a
CROSS  JOIN LATERAL (
   SELECT *
   FROM (
      SELECT DISTINCT b.ip
      FROM   sample_b b
      WHERE  b.uniqueid @> a.uniqueid  -- array operator
      ) b1
   ORDER  BY random()
   LIMIT  s.samplesz
   ) b;

But this performs poorly if there are many matches in sample_b. The best technique depends on cardinalities and your exact definition of "random". There are many shades of grey ...
Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • The issue is that there are over 2B potentially matching records in sample_b (which actually contains a lot more records that I am eliminating for other mismatches not shown). As far as my definition of "random", I understand that random is not truly random, but I guess I'm looking for results in no discernible order; so long as the code or existing data structure is not influencing the result set, I'll accept it as random. – S Ayo Apr 19 '23 at 16:04
  • It's an interesting problem. The update sheds some light, more is missing. The solution can be optimized (dramatically) depending on the exact, complete situation. Either way, that's "above the pay grade" of a question on SO. Sounds more like consulting work. – Erwin Brandstetter Apr 23 '23 at 11:00