27

I have been experimenting with PostgreSQL and PL/V8, which embeds the V8 JavaScript engine into PostgreSQL. Using this, I can query into JSON data inside the database, which is rather awesome.

The basic approach is as follows:

CREATE or REPLACE FUNCTION 
  json_string(data json, key text) RETURNS TEXT AS $$
  var data = JSON.parse(data); 
  return data[key];
$$ LANGUAGE plv8 IMMUTABLE STRICT;

SELECT id, data FROM things WHERE json_string(data,'name') LIKE 'Z%';

Using, V8 I can parse JSON data into JS, then return a field and I can use this as a regular pg query expression.

BUT

On large datasets, performance can be an issue, as for every row I need to parse the data. The parser is fast, but it is definitely the slowest part of the process and it has to happen every time.

What I am trying to work out (to finally get to an actual question) is if there is a way to cache or pre-process the JSON ... even storing a binary representation of the JSON in the table that could be used by V8 automatically as a JS object might be a win. I've had a look at using an alternative format such as messagepack or protobuf, but I don't think they will necessarily be as fast as the native JSON parser in any case.

THOUGHT

PG has blobs and binary types, so the data could be stored in binary, then we just need a way to marshall this into V8.

Toby Hede
  • 36,755
  • 28
  • 133
  • 162
  • That should be `return data[key];`, right? – JMM May 25 '12 at 17:19
  • It sounds like what you really want is for Pg to store `json` fields natively in an efficient v8-compatible form that doesn't require parsing. Pg doesn't currently support this with the very new `json` field. If that's what you want, the first thing you need to do is research v8 to see if you can find support for high-performance binary serialisation and deserialisation of json, because if it doesn't have that you can't do what you want. – Craig Ringer May 28 '12 at 23:40
  • 2
    If performance is a concern, why don't you use JSON only for communication matters and save data into regular tables? – Gerardo Lima May 29 '12 at 09:48

5 Answers5

12

Postgres supports indexes on arbitrary function calls. The following index should do the trick :

CREATE INDEX json_idx ON things (json_string(field,'name'));
limscoder
  • 3,037
  • 2
  • 23
  • 37
  • That definitely works, and I am using it, but adding an index for every field can be problematic, and increasing the performance of non-indexed fields would enable support of ad-hoc queries. – Toby Hede May 27 '12 at 08:36
  • 1
    @TobyHede You'll be glad to know that there's work going on to unify hstore and json support, adding a hstore type that supports json representation, testing, indexing, etc. Hopefully in 9.4. – Craig Ringer Jun 13 '13 at 22:19
7

The short version appears to be that with Pg's new json support, so far there's no way to store json directly in any form other than serialised json text. (This looks likely to change in 9.4)

You seem to want to store a pre-parsed form that's a serialised representation of how v8 represents the json in memory, and that's not currently supported. It's not even clear that v8 offers any kind of binary serialisation/deserialisation of json structures. If it doesn't do so natively, code would need to be added to Pg to produce such a representation and to turn it back into v8 json data structures.

It also wouldn't necessarily be faster:

  • If json was stored in a v8 specific binary form, queries that returned the normal json representation to clients would have to format it each time it was returned, incurring CPU cost.

  • A binary serialised version of json isn't the same thing as storing the v8 json data structures directly in memory. You can't write a data structure that involves any kind of graph of pointers out to disk directly, it has to be serialised. This serialisation and deserialisation has a cost, and it might not even be much faster than parsing the json text representation. It depends a lot on how v8 represents JavaScript objects in memory.

  • The binary serialised representation could easily be bigger, since most json is text and small numbers, where you don't gain any compactness from a binary representation. Since storage size directly affects the speed of table scans, value fetches from TOAST, decompression time required for TOASTed values, index sizes, etc, you could easily land up with slower queries and bigger tables.

I'd be interested to see whether an optimisation like what you describe is possible, and whether it'd turn out to be an optimisation at all.

To gain the benefits you want when doing table scans, I guess what you really need is a format that can be traversed without having to parse it and turn it into what's probably a malloc()'d graph of javascript objects. You want to be able to give a path expression for a field and grab it out directly from the serialised form where it's been read into a Pg read buffer or into shared_buffers. That'd be a really interesting design project, but I'd be surprised if anything like it existed in v8.

What you really need to do is research how the existing json-based object databases do fast searches for arbitrary json paths and what their on-disk representations are, then report back on pgsql-hackers. Maybe there's something to be learned from people who've already solved this - presuming, of course, that they have.

In the mean time, what I'd want to focus on is what the other answers here are doing: Working around the slow point and finding other ways to do what you need. You could also look into helping to optimise the json parser, but depending on whether the v8 one or some other one is in use that might already be far past the point of diminishing returns.

I guess this is one of the areas where there's a trade-off between speed and flexible data representation.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • Thank you for this commentary. I'm stuck with the same problem and I now realize why storing a "binary version" of the JSON will not necessarily make things faster. – Saurabh Nanda Jun 13 '13 at 09:26
1

perhaps instead of making the retrieval phase responsible for parsing the data, creating a new data type which could pre-disseminate json data on input might be a better approach?

http://www.postgresql.org/docs/9.2/static/sql-createtype.html

Jason Dwyer
  • 208
  • 1
  • 8
  • But how would the JSON be stored in this case? The JSON datatype in PG9.2 is actually just a text field with a validation, so in order to use with V8 you need to parse the data. – Toby Hede May 20 '12 at 04:32
  • The problem is really that aspect ... how can I store the JSON in a native/binary "pre-parsed" format? – Toby Hede May 20 '12 at 04:35
  • i was thinking more along the lines of storing the indexable attributes in some internal data structure, then subsequent searches wouldnt operate on the json directly, but rather the indexed content ( denormalised perhaps? ). http://wiki.postgresql.org/wiki/XML_Support but dont expect me to have read it all :) have you looked at the xml data type? http://www.postgresql.org/docs/current/static/datatype-xml.html not sure how that is implemented under the hood, but something along the lines of mapping the json content to an sql db is what i had in mind ( yet managed not to mention above :) ) – Jason Dwyer May 20 '12 at 06:57
1

I don't have any experience with this, but it got me curious so I did some reading.

JSON only

What about something like the following (untested, BTW)? It doesn't address your question about storing a binary representation of the JSON, it's an attempt to parse all of the JSON at once for all of the rows you're checking, in the hope that it will yield higher performance by reducing the processing overhead of doing it individually for each row. If it succeeds at that, I'm thinking it may result in higher memory consumption though.

The CREATE TYPE...set_of_records() stuff is adapted from the example on the wiki where it mentions that "You can also return records with an array of JSON." I guess it really means "an array of objects".

Is the id value from the DB record embedded in the JSON?

Version #1

CREATE TYPE rec AS (id integer, data text, name text);

CREATE FUNCTION set_of_records() RETURNS SETOF rec AS
$$

  var records = plv8.execute( "SELECT id, data FROM things" );

  var data = [];


  // Use for loop instead if better performance

  records.forEach( function ( rec, i, arr ) {

    data.push( rec.data );

  } );

  data = "[" + data.join( "," ) + "]";

  data = JSON.parse( data );


  records.forEach( function ( rec, i, arr ) {

    rec.name = data[ i ].name;

  } );


  return records;

$$
LANGUAGE plv8;


SELECT id, data FROM set_of_records() WHERE name LIKE 'Z%'

Version #2

This one gets Postgres to aggregate / concatenate some values to cut down on the processing done in JS.

CREATE TYPE rec AS (id integer, data text, name text);

CREATE FUNCTION set_of_records() RETURNS SETOF rec AS
$$

  var cols = plv8.execute(

    "SELECT" +

    "array_agg( id ORDER BY id ) AS id," +

    "string_agg( data, ',' ORDER BY id ) AS data" +

    "FROM things"

  )[0];


  cols.data = JSON.parse( "[" + cols.data + "]" );


  var records = cols.id;


  // Use for loop if better performance

  records.forEach( function ( id, i, arr ) {

    arr[ i ] = {

      id : id,

      data : cols.data[ i ],

      name : cols.data[ i ].name

    };

  } );


  return records;

$$
LANGUAGE plv8;


SELECT id, data FROM set_of_records() WHERE name LIKE 'Z%'

hstore

How would the performance of this compare?: duplicate the JSON data into an hstore column at write time (or if the performance somehow managed to be good enough, convert the JSON to hstore at select time) and use the hstore in your WHERE, e.g.:

SELECT id, data FROM things WHERE hstore_data -> name LIKE 'Z%'

I heard about hstore from here: http://lwn.net/Articles/497069/

The article mentions some other interesting things:

PL/v8 lets you...create expression indexes on specific JSON elements and save them, giving you stored search indexes much like CouchDB's "views".

It doesn't elaborate on that and I don't really know what it's referring to.

There's a comment attributed as "jberkus" that says:

We discussed having a binary JSON type as well, but without a protocol to transmit binary values (BSON isn't at all a standard, and has some serious glitches), there didn't seem to be any point.

If you're interested in working on binary JSON support for PostgreSQL, we'd be interested in having you help out ...

JMM
  • 26,019
  • 3
  • 50
  • 55
  • hstore is pretty awesome, but doesn't have the same arbitrary depth as json – Toby Hede May 28 '12 at 06:41
  • "create expression indexes" is basically what I am doing, you can create an index on a function (detailed in @limscoder's response). – Toby Hede May 28 '12 at 06:42
  • @Toby "hstore is pretty awesome, but..." yeah, that's too bad, wasn't sure if that would be an issue for you. "you can create an index on a function." Yeah, I haven't used Postgres yet, so I didn't know you could do that until I saw limscoder's answer. That's pretty slick. I didn't connect the dots right away when I saw that, but later on figured out that's what the article I linked to was talking about. At first I thought it was talking about something specific to PL/V8. – JMM May 29 '12 at 02:10
  • Do you think there's anything to parsing the JSON all at once? Probably goes without saying, but (assuming the kind of setup I suggested basically works) I think the `set_of_records()` could be parameterized to limit the query and dynamically choose which elements to pull out of the JSON. – JMM May 29 '12 at 02:11
0

I don't know if it would be useful here, but I came across this: pg-to-json-serializer. It mentions functionality for:

parsing JSON strings and filling postgreSQL records/arrays from it

I don't know if it would offer any performance benefit over what you've been doing so far though, and I don't really even understand their examples.

Just thought it was worth mentioning.

JMM
  • 26,019
  • 3
  • 50
  • 55