Working with Greenplum 6 (based on PostgreSQL 9.4.), I have these sample tables:
Table SAMPLE_A
:
| uniqueId | SampleSz |
| -------- | -------- |
| 1 | 25 |
| 2 | 50450 |
| 3 | 9 |
Table SAMPLE_B
:
| IP | uniqueId |
| -------- | -------- |
| 1.4.4.5 | (1,2,3) |
| 2.5.6.7 | (2) |
| 3.4.7.8 | (1,3) |
I am trying to create a new table from the above:
For uniqueId=1
pull a random set of 25 (SAMPLE_A.SampleSz
) IPs (SAMPLE_B.IP
) where SAMPLE_A.uniqueId
is in the array of SAMPLE_B.uniqueId
. Then iterate to the next SAMPLE_A.uniqueId
and pull random 50450 ... etc.
Actual tables are a bit more complex, but I'm stuck here.
I was trying to write this for a singular record (and failed):
select i.ip, s.uniqueId
from SAMPLE_A s
join lateral (
select distinct ip
from SAMPLE_B i
where s.uniqueId = any(i.uniqueId)
-- ORDER BY random()
LIMIT s.SampleSz
) i on true
This throws an unmarshalling error. Even if it worked, it wouldn't solve my full issue, but I figured it was a first step.
***** UPDATE 1: Desired Result Set ***** I was going to do the result set as if the above tables are the entirety of the data set, but that really won't work (as it would show basically as if I just unnested the second table). So let's pretend that we are working (solely) with EVERY IP between 3.4.7.0 and 3.4.7.255; each has a record in SAMPLE_B with all 3 uniqueIds (1,2,3). For UniqueId 2, I would get the all 256 IPs (as it is less than the associated sample size (50,450). I would get 25 "random" records for 1. And for 3, I would get something like below (9 random records). Obviously, 2 will overlap completely (with 1&3); 1 & 3 could overlap each other on 0 to 9 records:
| IP | uniqueId |
| -------- | -------- |
| 3.4.7.25 | 3 |
| 3.4.7.5 | 3 |
| 3.4.7.7 | 3 |
| 3.4.7.8 | 3 |
| 3.4.7.84 | 3 |
| 3.4.7.61 | 3 |
| 3.4.7.112| 3 |
| 3.4.7.125| 3 |
| 3.4.7.194| 3 |
| 3.4.7.207| 3 |
| 3.4.7.11 | 3 |
| 3.4.7.8 | 1 |
| 3.4.7.1 | 1 |