How should I store data in a recommendation engine?

Question

I am developing a recommendation engine. I think I can’t keep the whole similarity matrix in memory. I calculated similarities of 10,000 items and it is over 40 million float numbers. I stored them in a binary file and it becomes 160 MB.

Wow! The problem is that I could have nearly 200,000 items. Even if I cluster them into several groups and created similarity matrix for each group, then I still have to load them into memory at some point. But it will consume a lot memory.

So, is there anyway to deal with these data?

How should I stored them and load into the memory while ensuring my engine respond reasonably fast to an input?

score 2 · Accepted Answer · answered Feb 28 '17 at 22:54

2

You could use memory mapping to access your data. This way you can view your data on disk as one big memory area (and access it just as you would access memory) with the difference that only pages where you read or write data are (temporary) loaded in memory.

If you can group the data somewhat, only smaller portions would have to be read in memory while accessing the data.

As for the floats, if you could do with less resolution and store the values in say 16 bit integers, that would also half the size.

answered Feb 28 '17 at 22:54

Danny_ds

11,201
1
24
46

I am using node.js, does memory mapping available in node.js? – arslan Mar 01 '17 at 05:54
1

@alim I don't know node.js, but it seems possible: http://stackoverflow.com/a/23748621/5708620 - but as also stated in that answer, it would probably be better to access that amount of data from some C++ (or other) code instead of script (and maybe keep some parts in memory for all threads - opening and closing a memory mapping for every request won't help much). – Danny_ds Mar 01 '17 at 10:44

How should I store data in a recommendation engine?

1 Answers1