OpenStack (Swift) or CEPH deduplication feature? or any deduplication HA storage cluster solutions?

Question

For an owncloud (or nextcloud) project we need to add a great amount of storage, I've been checking all options such as: CEPH, Openstack Swift/Cinder, GlusterFS, SDFS and Tahoe-lafs.

With this service we expect many of the same files to be added by users, that is why deduplication is quite important for us. So far the only solutions for deduplication of clustered storage data would be SDFS and Tahoe-lafs. However our concerns are these two are Java and Python and will hurt CPU to much. (*Yes deduplication will likely mean more RAM and CPU as well)

Perhaps one of you have a better solution? *deduplication filesystem (e.g. ZSF) will not work as data is stored on multiple machines (HA Cluster).

Right now Openstack Swift currently version (2.13.0) has no deduplication feature as far as I know. — Nelson Marcos, May 30 '17 at 12:20

score 1 · Answer 1 · answered Aug 30 '17 at 06:43

This is not a complete solution which is what I think you are looking for, but rather an open source deduplication library for Node.js with a native binding written in C++ and a reference implementation written in Javascript:

https://github.com/ronomon/deduplication

It should be fast enough if you can implement the indexing yourself using an LSM-Tree backed KV store.

OpenStack (Swift) or CEPH deduplication feature? or any deduplication HA storage cluster solutions?

1 Answers1