1
  • Main question: whether to store a data table as array of row objects, or as an object of column arrays.
  • Proximate question: How to measure the memory footprint of an object.
  • Practical question: How do I read the memory profiler in Chrome?

Background

Working with rectangular data tables in Javascript, both in browser and/or Node.js. Many leading libraries like D3 and Crossfilter store data as arrays of objects, e.g.

var rows = 
  [{name: 'apple', price: 1.79, ...}, 
   {name: 'berry', price: 3.49, ...}, 
   {name: 'cherry', price: 4.29, ...}, ...
  ]

However, it seems with many columns (my use case) and potentially many rows, the overhead of storing keys can become very heavy, and it would be more efficient to store the data (and iterate over it) storing each column as an array, as in:

var cols = {
   name: ['apple', 'berry', 'cherry', ...],
   price: [1.79, 3.49, 4.29, ...], 
   ...
}

Profiling question

One answer to this post describes using the Chrome memory profile: JavaScript object size

I set up the following simplistic benchmark below. The code can be copied/pasted to the console of Chrome and executed. I then looked at the Chrome profiler, but not sure how to read it.

At first glance, the retained size seems clearly in favor of columns:

  • window.rowData: 294,170,760 bytes
  • window.colData: 44,575,896 bytes

But if I click on each, they give me the same (huge) retained size:

  • window.rowData: 338,926,668 bytes
  • window.colData: 338,926,668 bytes

Benchmark code

The following code can be copy/pasted to Chrome console:

function makeid(len) {
  var text = "";
  var possible = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";

  for (var i = 0; i < len; i++)
    text += possible.charAt(Math.floor(Math.random() * possible.length));

  return text;
}


/* create set of 400 string keys with 8 random characters.*/
var keys = window.keys = [], i, c;
for ( i=0;  i < 400; i++) {
  keys.push(makeid(8));
}

/* METHOD 1:  Create array of objects with {colName: cellValue} pairs */
var rows = window.rowData = [];
for ( i = 0; i < 10000; i++) {
  var row = {};
  for ( c = 0; c < 400; c++) {
    row[keys[c]] = Math.random();
  }

  rows.push(row);
}
/* METHOD 2: Create set of columns { colName: [values]} */
var cols = window.colData = {};
for ( c=0; c<400; c++) {
  var col = cols[keys[c]] = [];
  for ( i=0; i<10000; i++) {
    col[i] = rows[i][keys[c]];
  }
}
Community
  • 1
  • 1
prototype
  • 7,249
  • 15
  • 60
  • 94

1 Answers1

2

I would be very careful about storing data in this fashion.

The main thing that worries me is usability. In my opinion, the biggest drawback of storing data in columns like this is that you now become responsible for managing insertion and removal of data in an atomic fashion. You will need to be very careful to ensure that if you remove or insert a value in one column, you also remove or insert a value at the same location in all of the other columns. You'll also have to make sure that whatever is using the data does not read values in the middle of the removal/insertion. If something tries to read a "row" from the data before the update finishes, it will see an inconsistent view which would be a bad thing. This all sounds very complicated and generally unpleasant to handle in Javascript to me.

When data is stored as objects in array, you can handle insertion/deletion very simply. Simply remove or add an entire object to the array and your done. The whole operation is atomic so you don't have to worry about timing, and you'll never have to worry about forgetting to remove an item from a column.

As far as memory usage is concerned, it really depends on the actual data you are storing. If you have data like that shown in your test example, where every "row" has a value in every "column" you will likely save some memory because the interpreter does not need to store the names of keys for each value in an object. How this is done is implementation specific, however, and after a little research I couldn't really identify if this is the case or not. I could easily imagine a clever interpreter using a look up table to store shared key names, in which case you will have almost negligible overhead when storing objects in an array compared to the column solution. Also, if you data happens to be sparse, I.E. not every row has a value for every column, you could actually use more memory storing data in columns. In the column scheme you will need to store a value in every single column for every row, even if it's a null or some other indicator of empty space, to maintain alignment. If you store objects in an array, you can leave off key/value pairs where necessary. If there are a lot of key/value pairs that you can leave off, you can save a ton of memory.

As Donald Knuth said, "Premature optimization is the root of all evil." By storing your data in columns like that, you will be taking on a lot of extra work to make sure that your data is consistent (which may lead to fragile code) and you will be making your code much harder to read because people won't be expecting data to be stored like that. You should only inflict these things upon yourself if you really, really, need to. My recommendation would be to stick to the objects in an array solution, since it make your code much easier to read and write, and it's pretty unlikely that you actually need to save the memory that would be saved by the column solution. If, down the line, you have performance issues you can re-visit the idea of storing data this way. Even then, I'd be willing to bet that there are other, easier ways of making things run faster.

TwentyMiles
  • 4,063
  • 3
  • 30
  • 37
  • 1
    Excellent points - dynamic updates, sparse arrays, and optimizing interpreter favor arrays. Here have dense static data, thus 94MB vs 294MB is tempting, meaning 3x larger max table size. Is there a way to verify if that possible savings is real or a mirage? – prototype Feb 25 '14 at 21:28