Built in way to read couchdb document size?

Question

I'm experimenting with using couchdb as a message store and would like to report the message size.

Ideally it would be nice to read a _size attribute. At worst I could check the string length of the entire document's JSON. I may even want to use the size as a view key.

What do you think is the best way to record document size and why do you think that method is best?

score 9 · Accepted Answer · answered Jan 14 '12 at 19:04

9

You could make a view;

function (doc) {
    emit(doc._id, JSON.stringify(doc).length);
}

answered Jan 14 '12 at 19:04

Robert Newson

4,631
20
18

3

any way of including sizes of attachments in this? – Eduardo Scoz Nov 16 '12 at 20:49
Warning: this is NOT the length in bytes. This is the [number of utf-16 code points](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length). – Marcello Nuccio Jun 24 '16 at 07:35
Does this `length` value include `._attachments`? – Armand Feb 21 '17 at 05:35

Marcello Nuccio · Answer 2 · 2017-03-03T14:25:22.660

5

You can make a HEAD request:

$ curl -X HEAD -I http://USER:PASS@localhost:5984/db/doc_id
HTTP/1.1 200 OK
Server: CouchDB/1.1.1 (Erlang OTP/R14B03)
Etag: "1-c0b6a87a64fa1b1f63ee2aa7828a5390"
Date: Tue, 17 Jan 2012 21:32:43 GMT
Content-Type: text/plain;charset=utf-8
Content-Length: 740047
Cache-Control: must-revalidate

The Content-Length header contains the length in bytes of the document. This is very fast because you don't need to download the full document.

But there's a caveat: Content-Length is the number of bytes of the utf-8 version of the document (see the Content-Type header); String.length is the number of 16-bit utf-16 code units in a string.

i.e., they are counting different things, bytes versus code units, of different encodings of the document, utf-8 versus utf-16.

edited Mar 03 '17 at 14:25

answered Jan 17 '12 at 21:37

Marcello Nuccio

3,901
2
28
28

This is also a good answer. Since I want to know to be able to use the size in an index, I prefer the other answer this time. – Sir Wobin Jan 23 '12 at 14:56
@Marcello Good answer, but one small thing - are all utf-16 code points really 16-bit? – Armand Feb 21 '17 at 05:34
@Armand AFAIK, yes (I am not an expert in Unicode). But some characters are represented by more than one code point. This is explained clearly in the [String.length's documentation](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length). – Marcello Nuccio Feb 27 '17 at 14:03
Hmm weird - I'm still not 100% clear the difference between characters and code points, but [wikipedia says](https://en.wikipedia.org/wiki/UTF-16): _The encoding is variable-length, as code points are encoded with one or two 16-bit code units._ – Armand Feb 28 '17 at 14:53
@Armand it is quite simple: the chars are the "units of text", that is what humans use to compose a text; the code points are the "units of encoding", that is what computers use to compose a char. Many chars are represented by a single code point, but UTF-16 allows for more than one code point to be used for a single char. – Marcello Nuccio Mar 01 '17 at 10:32
1

Thanks, Marcello - that's really clear :-) But from wikipedia, it looks like a UTF-16 code point could be two code units, which would make the code point 32 bits. So perhaps your answer needs clarifying slightly where it says "String.length is the number of **16-bit** utf-16 code points"? – Armand Mar 03 '17 at 06:11
@Armand Oh, now I see the error: I didn't notice that I did wrote "code points" instead of "code units". The latter are always of 16-bit, and String.length returns the number of code *units*. Thanks for the hint. – Marcello Nuccio Mar 03 '17 at 14:32

score 0 · Answer 3 · answered Jun 05 '17 at 01:19

Based on the accepted answer, I suggest the following improvement:

function (doc) {
    emit([JSON.stringify(doc).length, doc._id], doc._id);
}

This has the following advantages:

doc length as the first key part lets you sort by document size.
doc id as second key part ensures that documents with the same size show up as separate entries.
doc id in the value part makes it easier to copy the ID when in futon (as the key part gives you a link pointer there).

Built in way to read couchdb document size?

3 Answers3

Linked