Couchbase : How to maintain arrays without duplicate elements?

Question

We have a Couchbase store which has the Customer data.

Each customer has exactly one document in this bucket.
Daily transactions will result in making updates to this customer data.

Sample document. Let's focus on the purchased_product_ids array.

{
  "customer_id" : 1000
  "purchased_product_ids" : [1, 2, 3, 4, 5 ] 
      # in reality this is a big array - hundreds of elements
  ... 
  ... many other elements ...
  ...
} 

Existing purchased_product_ids : 
    [1, 2, 3, 4, 5]

products purchased today : 
    [1, 2, 3, 6]  // 6 is a new entry, others existing already

Expected result after the update: 
    [1, 2, 3, 4, 5, 6]

I am using Subdocument API to avoid large data transfer between server and clients.

Option1 "arrayAppend" :

customerBucket.mutateIn(customerKey)
    .arrayAppend("purchased_product_ids", JsonObject for [1,2,3,6] )
    .execute();

It results in duplicate elements. 
"purchased_product_ids" : [1, 2, 3, 4, 5, 1, 2, 3, 6]

Option2 "arrayAddUnique" :

customerBucket.mutateIn(customerKey)
    .arrayAddUnqiue("purchased_product_ids", 1 )
    .arrayAddUnqiue("purchased_product_ids", 2 )
    .arrayAddUnqiue("purchased_product_ids", 3 )
    .arrayAddUnqiue("purchased_product_ids", 6 )
    .execute();

It throws exception for most of the times, 
because those elements already existing.

Is there any better way to do this update ?

score 4 · Answer 1 · answered Nov 28 '18 at 11:48

4

Do you need purchased_product_ids to be ordered? If not you can convert it to a map, e.g.

{
  "customer_id" : 1000
  "purchased_product_ids" : {1: {}, 3: {}, 5: {}, 2: {}, 4: {}}
}

and then write to that map with subdoc, knowing you won't be conflicting (assuming product IDs are unique):

customerBucket.mutateIn(customerKey)
   .upsert("purchased_product_ids.1", JsonObject.create()) // already exists
   .upsert("purchased_product_ids.6", JsonObject.create()) // new product
   .execute();

which will result in:

{
  "customer_id" : 1000
  "purchased_product_ids" : {1: {}, 3: {}, 6: {}, 5: {}, 2: {}, 4: {}}
}

(I've used JsonObject.create() as a placeholder here in case you need to associate additional information for each customer-order paid, but you could equally just write null. If you do need purchased_product_ids to be ordered, you can write the timestamp of the order, e.g. 1: {date: <TIMESTAMP>}, and then order it in code when you fetch.)

answered Nov 28 '18 at 11:48

Graham Pople

416
4
8

So you suggest to use JSON dictionaries instead of JSON arrays. This solves my data update problem. Just one question though : Can we UNNEST dictionaries in N1QL like we do arrays ? (Most of our query patterns use N1QL UNNEST). Please help if there is a way to do that with dictionaries. Otherwise this data will not be much useful when we query it later. – ramu Nov 28 '18 at 14:39
I'm not certain. Perhaps, instead of changing the data structure, it would be easiest to avoid subdoc and simply get() and replace() the document, using CAS to detect any concurrency issues? – Graham Pople Nov 28 '18 at 15:12
Graham, in my case, compared to the transaction size, document size of a customer is so big. So, the "get, process and replace" will result in moving a much larger data(whole customer document) twice, for every transaction. But the changes I want to "push" is smaller, so that is the reason I went for subdoc. I will try the N1QL approach suggested by Johan. – ramu Nov 29 '18 at 01:13
True, but if you know the document ID then it may still be faster (though admittedly higher bandwidth) to do a full document SDK 'get and replace' than the N1QL query, even with the two network round trips, as your app can send the requests directly to the correct data node, plus there's no index or query parsing involved. It will also reduce load on the query node(s). If this is a mission-critical part of the app and latency is crucial then it could be worth benchmarking both approaches. – Graham Pople Nov 29 '18 at 18:05
1

you are correct. I need to check how each option behaves with a considerable load in the system. with few records it may be misleading. – ramu Nov 30 '18 at 07:25

Johan Larson · Accepted Answer · 2018-11-28T13:40:56.300

4

You could use N1QL, and the ARRAY_APPEND() and ARRAY_DISTINCT() functions.

UPDATE customer USE KEYS "foo" 
SET purchased_product_ids = ARRAY_DISTINCT(ARRAY_APPEND(purchased_product_ids, 9))

Presumably this would be a prepared statement and the key itself and the new value would be supplied as parameters.

Also, if you want to add multiple elements to the array at once, ARRAY_CONCAT() would be a better choice. More here:

https://docs.couchbase.com/server/6.0/n1ql/n1ql-language-reference/arrayfun.html

edited Nov 28 '18 at 13:40

answered Nov 28 '18 at 13:24

Johan Larson

1,880
1
11
14

thanks Johan. Seems to be simple if I go with N1QL route instead of SDK api's. If I use N1QL instead of SDK, am I going to miss anything on the performance or functionality ? Does that avoid server to client round trip data transfers like SubDocument api's do ? – ramu Nov 28 '18 at 14:48
The document will not travel to the client if you launch a query of the form I use above. The execution happens in the query engine. The query engine will need to pull the document from whatever server it is stored on and write it back there, though, but that happens intra-cluster. – Johan Larson Nov 28 '18 at 15:44
If you are using N1QL and you can also consider ARRAY_PUT(). https://docs.couchbase.com/server/6.0/n1ql/n1ql-language-reference/arrayfun.html – vsr Nov 28 '18 at 16:43
Generally speaking the SDK API will be faster than the N1QL route. – Paddy Nov 29 '18 at 14:57

Couchbase : How to maintain arrays without duplicate elements?

2 Answers2