0

I was playing around with PostgreSQL9.3 and decided to try out the JSON type. So I wrote a Query which counts certain JSON Data and groups them by a user_id, but the process took very long to finish. Could somebody give me a hint to improve the performance?

But let me show you the test data and the relations first:

I downloaded some test data which contained one JSON Document per line. The JSON Document was composed out of one array named "hashtags", one array named "urls" and one field with a field named "user_id" (non-unique).

Test data:

{"hashtags": ["a", "b", "c"],"user_id": 210824047, "urls": ["http://www.google.com"]}
{"hashtags": ["b"], "user_id": 78149438, "urls": ["http://www.facebook.com"]}
{"hashtags": ["a", "c"], "user_id": 123791448, "urls": ["http://www.twitter.com","http://www.facebook.com"]}
...

Then I created a simple relation with an auto-incrementing primary key and a column which contains an JSON type like this:

CREATE TABLE tweets(id BIGSERIAL PRIMARY KEY, tweet JSON);

I loaded the data into postgresql where one JSON Document now equals one row:

id | tweet
---+---------------------------------------------------------------------------------
1  |{"hashtags": ["a", "b", "c"],"user_id": 210824047, "urls": ["http://www.google.com"]}
2  |{"hashtags": ["b"], "user_id": 78149438, "urls": ["http://www.facebook.com"]}
3  |{"hashtags": ["a", "c"], "user_id": 123791448, "urls": ["http://www.twitter.com","http://www.facebook.com"]}

Now I wanted to know how often a hastag was used by a certain user_id. To do that I had to use a sub-query which used the json_array_elements() function. The function takes a specified JSON array and turns it into a column.

The Query looks like this:

    select foo.uid, foo.tag, count(foo.*) from 
        (select (tweet::json->>'user_id')::int as uid, 
        json_array_elements(tweet::json->'hashtags')::text as tag from tweets) 
    as foo group by foo.uid, foo.tag;

I got the result I wanted, but the query took very long to finish. For example I loaded 200.000 JSON Documents into postgresql, which is slightly less than 25MB, and it took around 10mins until I got a result. I also tried various indexes, but the performance only improved by half a minute at best.

Now being a postgresql beginner I ask myself, if it is really that slow or did I do something wrong? if it is the latter, what can I do against it?

Your help will be much appreciated.

  • 1
    It should be better to store data with known structure in regular fields. You can try function index but JSON is not a good way to store data anyways, it uses more disc space and it is hard to query http://stackoverflow.com/questions/17807030/how-to-create-index-on-json-field-in-postgres-9-3 – jbaliuka Feb 13 '14 at 18:18

1 Answers1

1

JSON is still pretty new to PostgreSQL and isn't terribly well optimized yet, but is improving rapidly.

When 9.4 is released later this year, it looks like there will be a huge improvement in this area (I get about 2 seconds for your query with 196,608 records).

Actually that improvement is also in 9.3.3, which should be out in a week.

jjanes
  • 37,812
  • 5
  • 27
  • 34