4

I'm experimenting with using couchdb as a message store and would like to report the message size.

Ideally it would be nice to read a _size attribute. At worst I could check the string length of the entire document's JSON. I may even want to use the size as a view key.

What do you think is the best way to record document size and why do you think that method is best?

Sir Wobin
  • 1,080
  • 1
  • 8
  • 12

3 Answers3

9

You could make a view;

function (doc) {
    emit(doc._id, JSON.stringify(doc).length);
}
Robert Newson
  • 4,631
  • 20
  • 18
5

You can make a HEAD request:

$ curl -X HEAD -I http://USER:PASS@localhost:5984/db/doc_id
HTTP/1.1 200 OK
Server: CouchDB/1.1.1 (Erlang OTP/R14B03)
Etag: "1-c0b6a87a64fa1b1f63ee2aa7828a5390"
Date: Tue, 17 Jan 2012 21:32:43 GMT
Content-Type: text/plain;charset=utf-8
Content-Length: 740047
Cache-Control: must-revalidate

The Content-Length header contains the length in bytes of the document. This is very fast because you don't need to download the full document.

But there's a caveat: Content-Length is the number of bytes of the utf-8 version of the document (see the Content-Type header); String.length is the number of 16-bit utf-16 code units in a string.

i.e., they are counting different things, bytes versus code units, of different encodings of the document, utf-8 versus utf-16.

Marcello Nuccio
  • 3,901
  • 2
  • 28
  • 28
  • This is also a good answer. Since I want to know to be able to use the size in an index, I prefer the other answer this time. – Sir Wobin Jan 23 '12 at 14:56
  • @Marcello Good answer, but one small thing - are all utf-16 code points really 16-bit? – Armand Feb 21 '17 at 05:34
  • @Armand AFAIK, yes (I am not an expert in Unicode). But some characters are represented by more than one code point. This is explained clearly in the [String.length's documentation](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length). – Marcello Nuccio Feb 27 '17 at 14:03
  • Hmm weird - I'm still not 100% clear the difference between characters and code points, but [wikipedia says](https://en.wikipedia.org/wiki/UTF-16): _The encoding is variable-length, as code points are encoded with one or two 16-bit code units._ – Armand Feb 28 '17 at 14:53
  • @Armand it is quite simple: the chars are the "units of text", that is what humans use to compose a text; the code points are the "units of encoding", that is what computers use to compose a char. Many chars are represented by a single code point, but UTF-16 allows for more than one code point to be used for a single char. – Marcello Nuccio Mar 01 '17 at 10:32
  • 1
    Thanks, Marcello - that's really clear :-) But from wikipedia, it looks like a UTF-16 code point could be two code units, which would make the code point 32 bits. So perhaps your answer needs clarifying slightly where it says "String.length is the number of **16-bit** utf-16 code points"? – Armand Mar 03 '17 at 06:11
  • @Armand Oh, now I see the error: I didn't notice that I did wrote "code points" instead of "code units". The latter are always of 16-bit, and String.length returns the number of code *units*. Thanks for the hint. – Marcello Nuccio Mar 03 '17 at 14:32
0

Based on the accepted answer, I suggest the following improvement:

function (doc) {
    emit([JSON.stringify(doc).length, doc._id], doc._id);
}

This has the following advantages:

  • doc length as the first key part lets you sort by document size.

  • doc id as second key part ensures that documents with the same size show up as separate entries.

  • doc id in the value part makes it easier to copy the ID when in futon (as the key part gives you a link pointer there).

Michel Müller
  • 5,535
  • 3
  • 31
  • 49