11

There are two tables:

Authorized Contacts (auth_contacts):

(
userid varchar
contacts jsonb
)

contacts contains an array of contacts with attributes {contact_id, type}

discussion:

(
contact_id varchar
discussion_id varchar
discussion_details jsonb
)

The table auth_contacts has at least 100k records making it non JSONB type is not appropriate according as it would double or triple the amount of records.

Sample data for auth_contacts:

userid  | contacts
'11111' | '{"contact": [{"type": "type_a", "contact_id": "1-A-12"}
                      , {"type": "type_b", "contact_id": "1-A-13"}]}'

discussion table has 5 million odd records.

I want to join on discussion.contact_id (relational column) with contact id which a json object inside array of json objects in auth_contacts.contacts.

One very crude way is:

SELECT *
FROM discussion d 
JOIN (SELECT userid, JSONB_OBJECT_KEYS(a.contacts) AS auth_contact
      FROM auth_contacts a) AS contacts
      ON (d.contact_id = contacts.auth_contact::text)

What this does is actually at runtime create (inner sql) userid vs contact id table (Which is what I was avoiding and hence went for JSONB data type This query for a user with large records takes 26 + seconds which is not all good. Tried a few other ways: PostgreSQL 9.4: Aggregate / Join table on JSON field id inside array

But there should be a cleaner and better way which would be as simple as JOIN d.contact_id = contacts -> contact -> contact_id? When I try this, it doesn't yield any results.

When searching the net this seems to be a pretty cumbersome task?

Community
  • 1
  • 1
Prachi Tripathi
  • 179
  • 1
  • 2
  • 6
  • If you try to use `contacts->'contact'->'contact_id'` it will not get any values since `contact` is an array. Technically you could use something like `contacts->'contact'@>'[{"contact_id":"1-A-12"}]'` but for joining based on that I'm not sure it would be that efficient. – Sami Kuhmonen Jul 09 '15 at 06:49
  • contacts->'contact'->0->'contact_id', contacts->'contact'->1->'contact_id' returns subsequent values. As for '@>' it is actually for checking condition (whether left contains within right) and cannot be used for join as per my knowledge. Would it be possible ? – Prachi Tripathi Jul 09 '15 at 07:07
  • It is possible to do an ugly `SELECT * FROM discussion d JOIN auth a ON a.contacts->'contact'@>('[{"contact_id":"'||d.contact_id||'"}]')::jsonb` based on a quick test, but I wouldn't expect it to be that performant. Maybe using `jsonb_array_elements()´ would be better, but it still has to explode the values out of the JSON anyway. – Sami Kuhmonen Jul 09 '15 at 07:14
  • Agreed yes. But as you mentioned, performance wise it is a killer. I am thinking there should be a quicker and cleaner way like it is in MongoDB? – Prachi Tripathi Jul 09 '15 at 12:26
  • What is a `lakh record`? Your "crude way" is invalid syntax. I am also confused by your cardinalities: 5 MM in `discussion`, how many in `auth_contacts`? – Erwin Brandstetter Jul 10 '15 at 05:59
  • @SamiKuhmonen: Supported by a matching index, the contains operator should actually perform very well for big tables. – Erwin Brandstetter Jul 10 '15 at 06:26
  • @ErwinBrandstetter Yes, it would, but in this case my ugly way would generate a jsonb object dynamically for every row and then search for that, I'm pretty sure that will be a huge performance killer, or would it be fast? If it was used with a static jsonb, it would be usable. – Sami Kuhmonen Jul 10 '15 at 06:27
  • 2
    @SamiKuhmonen: The index is used on the `auth_contacts` side. For instance to look up auth_user for a given discussion. For the other direction, you would unnest an then join utilizing the PK index on `discussion.contact_id`. But this is all academic talk. The real solution is a sane data model. – Erwin Brandstetter Jul 10 '15 at 06:34
  • @ErwinBrandstetter why do you keep commenting that using json/jsonb for foreign keys is "insane"? it seems postgres either does JSON/JSONB or it doesn't, and if it can't use a foreign key in JSON/JSONB then it's a shortcoming of postgres. I am just ramping up on it, maybe I'm asking too much of it...? – moodboom Jan 27 '19 at 22:47

1 Answers1

20

Proof of concept

Your "crude way" doesn't actually work. Here is another crude way that does:

SELECT *
FROM  auth_contacts a
    , jsonb_to_recordset(a.contacts->'contact') AS c(contact_id text)
JOIN  discussion d USING (contact_id);

As has been commented, you can also formulate a join condition with the contains operator @>:

SELECT *
FROM   auth_contacts a
JOIN   discussion d ON a.contacts->'contact'
                    @> json_build_array(json_build_object('contact_id', d.contact_id))::jsonb

But rather use JSON creation functions than string concatenation. Looks cumbersome but will actually be very fast if supported with a functional jsonb_path_ops GIN index:

CREATE INDEX auth_contacts_contacts_gin_idx ON auth_contacts
USING  gin ((contacts->'contact') jsonb_path_ops);

Details:

Proper solution

This is all fascinating to play with, but the problem here is the relational model. Your claim:

hence making it non JSONB type is not appropriate according as it would double or triple the amount of records.

is the opposite of what's right. It's nonsense to wrap IDs you need for joining tables into a JSON document type. Normalize your table with a many-to-many relationship and implement all IDs you are working with inside the DB as separate columns with appropriate data type. Basics:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • @PrachiTripathi: `jsonb_object_keys()` makes sense for that. But you also have a plain syntax error. `c.contacts` should probably be `a.contacts`. – Erwin Brandstetter Jul 10 '15 at 09:34
  • Apologies-my crude way works where I store it as follows: {"3-1LS-86": "type_a", "3-1IC-1432": "type_b"} and not as array,hence the subquery of keys works. Abt the proper solution agreed it shud be relational but consider auth_contacts table : number of users we are looking at could be a hundred thousand (100000+ ) and per user authorized contacts could be on an average 6000 and some thousand of users have 60k+ authorized contacts. Now, rows in auth contacts would be 500million+ records which is why i felt a join on this with timeline having 1million records would kill the system,wouldnt it? – Prachi Tripathi Jul 10 '15 at 09:37
  • Also its a one to many relationship between user and authhorized contacts and not many to many (Userid, auth_contacts) – Prachi Tripathi Jul 10 '15 at 09:38
  • Looks like an n:m relationship between `auth_contacts` and `discussion`. But I don't know the details. If you expect that many rows, be sure not use integer (or bigint if you must) for ID columns and not string types. – Erwin Brandstetter Jul 10 '15 at 10:17
  • Yes using ID is ok. But I am still wondering when postgres JSONB when searched online shows it is faster than MongoDB but the join as given in example takes 28 seconds for a user who has 70k + contacts – Prachi Tripathi Jul 14 '15 at 12:32
  • @Prachi: There are *certainly* ways to make this faster, even with your subptimal current db design. Best start a *new question* so not to overload this one. Provide essential information: exact table definition, which columns you need to return (you don't need `SELECT *`, do you?), which indexes you created and the output of `EXPLAIN (ANALYZE, BUFFERS)` - best post to explain.depesz.com and post the link. Consider instructions in the **[tag info to `[postgresql-performance]`](http://stackoverflow.com/tags/postgresql-performance/info)**. You can always link to this question for context. – Erwin Brandstetter Jul 14 '15 at 13:16