1

I have the following query which is extracting several columns of data out of a JSON object (facebook_results Postgres 10 column of json datatype).

Sometimes the arrays in this object contain 10,000+ items.

The goal of this is to get a flat map of denormalized data from every column in the object, and where there's an array I want to get all the columns with the objects contained within there too (and obviously just duplicating the data down for the outer keys).

None of the innermost keys contain arrays, so I don't need to worry about that. I only care about the matches and nodes arrays which should be 'expanded'.

Right now the query works but it's extremely, extremely slow. I'm assuming because it's because of a poorly written query performing that is recursive or has unnecessary complexity slowdowns.

SELECT
  id AS slice_id,
  json_array_elements(facebook_results -> 'table' -> 'matches') -> 'table' -> 'size'       AS match_size,
  json_array_elements(facebook_results -> 'table' -> 'matches') -> 'table' -> 'score'      AS match_score,
  json_array_elements(facebook_results -> 'table' -> 'matches') -> 'table' -> 'width'      AS match_width,
  json_array_elements(facebook_results -> 'table' -> 'matches') -> 'table' -> 'format'     AS match_format,
  json_array_elements(facebook_results -> 'table' -> 'matches') -> 'table' -> 'domain'     AS match_domain,
  json_array_elements(json_array_elements(facebook_results -> 'table' -> 'matches') -> 'table' -> 'nodes') -> 'table' -> 'crawl_date' AS node_crawl_date,
  json_array_elements(json_array_elements(facebook_results -> 'table' -> 'matches') -> 'table' -> 'nodes') -> 'table' -> 'url'        AS node_url
FROM slices
WHERE id = 169

Here's an example of what's contained in the facebook_results column:

{
  "table":{
    "matches": [
      {  
        "table":{  
          "nodes":[  
            {  
              "table":{  
                "crawl_date":"2013-06-21",
                "url":"http://example.com"
              }
            }
          ],
          "size":7962624,
          "score":47.059,
          "width":3456,
          "format":"MP4",
          "domain":"example.com"
        }
      }
    ]
  }
}

Does anyone have an idea how I could optimize this?

Tallboy
  • 12,847
  • 13
  • 82
  • 173

1 Answers1

4

You could rewrite your query using LATERAL:

SELECT
  id AS slice_id,
  s.t -> 'size'       AS match_size,
  s.t -> 'score'      AS match_score,
  s.t -> 'width'      AS match_width,
  s.t -> 'format'     AS match_format,
  s.t -> 'domain'     AS match_domain,
  s.t2-> 'crawl_date' AS node_crawl_date,
  s.t2-> 'url'        AS node_url
FROM slices
,LATERAL (
SELECT json_array_elements(facebook_results -> 'table' -> 'matches') -> 'table',
json_array_elements(json_array_elements(facebook_results -> 'table' -> 'matches') 
           -> 'table' -> 'nodes') -> 'table') s(t,t2)
WHERE id = 169;

DBFiddle Demo

Or even shorter:

SELECT
  id AS slice_id,
  s.t   -> 'size'       AS match_size,
  s.t   -> 'score'      AS match_score,
  s.t   -> 'width'      AS match_width,
  s.t   -> 'format'     AS match_format,
  s.t   -> 'domain'     AS match_domain,
  s2.t2 -> 'crawl_date' AS node_crawl_date,
  s2.t2 -> 'url'        AS node_url
FROM slices
,LATERAL(SELECT 
  json_array_elements(facebook_results -> 'table' -> 'matches') -> 'table' ) s(t)
,LATERAL(SELECT json_array_elements(s.t -> 'nodes') -> 'table') s2(t2)
WHERE id = 169;

DBFiddle Demo2

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • 1
    That's incredible. I have read the definition of lateral join several times, it seems it makes a cartesion product but I still can't quite conceptualize it vs something simple like a inner or left outer join. Do you think you could explain a bit how this amazing query works? – Tallboy Aug 28 '18 at 07:01
  • By the way it seems that it generates about 1000 rows in 60 seconds, but in ruby I can loop through all the same arrays in mere seconds. Is this query still doing something recursive? I will say that the arrays have thousands of `matches` but there should only be 1-5 `nodes`. I can tell it's not going slow because of the byte size of the data (it's not like I'm dumping MB of text back). It just seems to be really slow parsing it. – Tallboy Aug 28 '18 at 07:03
  • @Tallboy You could conceptualize LATERAL as correlated subquery. I used it when I want to avoid repeating myself [PostgreSQL: using a calculated column in the same query](https://stackoverflow.com/questions/8840228/postgresql-using-a-calculated-column-in-the-same-query/36530228#36530228). As for parse time you could create index on top of it. – Lukasz Szozda Aug 28 '18 at 14:39
  • oh damn!!! I didn't know you could create an index like that. Holy crap that is cool – Tallboy Aug 28 '18 at 15:37
  • How exactly is that index working, I'm only familiar with a btree and also multi-column indexes. "What" is it actually indexing? Isn't the array already O(1) access time, so not able to be improved? – Tallboy Aug 28 '18 at 15:39
  • I get this error `PostgreSQL said: set-returning functions are not allowed in index expressions` – Tallboy Aug 28 '18 at 15:59