Data deduplication framework?

Question

I want to integrate data deduplication into software that I am writing to back up vmware images. I haven't been able to find anything suitable for what I think I need. There seem to be a LOT of complete solutions that include one form of deduplication or another. These include storage or backup solutions that use public or private clouds, specialized file systems, storage networks or devices, etc. However, I need to develop my own solution and integrate dedupe into that. My software will be written in C# and I would like to be able to call an API to tell it what to dedupe.

The type of deduplication I am talking about is not deduping one image against another image--typically the approach for producing incremental or differential backups of two "versions" of something--or what is called "Client backup deduplication" in the Wikipedia entry on data deduplication, as I already have a solution to do that, and want to take things a step further.

I envisage an approach that would allow me to dedupe chunks of data somehow on a global level (i.e. some form of global deduplication). To be global, I imagine there would be a central lookup table of some sort (e.g. an index of hashes) that would tell the deduper that a copy of the data being examined is already held and does not need to be stored again. The chunks could be file-level (Single Instance Storage or SIS) or sub-file-/block-level deduplication. The latter should be more efficient (which is more important for our purposes than, say, processing overhead) and would be my preferred option, but I could make SIS work too, if I had to.

I have now done a lot of reading about other people's software that do deduping, as I mentioned above. I won't cite examples here because I am not trying to emulate anyone else's approach specifically. Rather, I haven't been able to find a programmers' solution and want to know if there's anything like that available. The alternative would be roll my own solution but that would be a pretty big task, to put it mildly.

Thanks.

Look into how rsync does stuff. It might give you some inspiring thoughts. Or maybe you already did that... — Daniel Mošmondor, Nov 16 '11 at 15:30
thanks, rsync's approach to using a sliding window of checksums to determine if there is match with the file it syncs to (as I understand it to work from my reading of rsync) corresponds to the part I mentioned about block-level dedupe. However, in a typical vmware image of a server, there are typically umpteen numbers of standard data items (e.g. OS and program files) that could be deduped out. Unless I somehow try to sync/compare against a massive central library of these files, I cannot dedupe them out in a practical manner. — stifin, Nov 16 '11 at 17:05
You can have hash of each file and store it along with a file. However, then you'll have 'file' vs 'image' conundrum. Anyway, you'll have to be more concrete with your questions here... — Daniel Mošmondor, Nov 16 '11 at 21:36
I've been as specific as I can while trying not to limit options for what might be out there: I need an API that supports global deduping. Using rsync seems to mean rolling my own solution--and actually as I said in my original question, I already have a tool that can dedupe one file against another. "Rolling my own" means building my own hash index for all the windows, sql, exchange, iis, etc files and making it global. I find if I describe the scenario, not just what I think I need, people with experience in the area often point me in a better direction when they see the context. — stifin, Nov 17 '11 at 11:07
@Leaurus. Thank you for your efforts in improving/editing the questions. While your edits might be valid, please refrain from doing minor edits. — Adi, Oct 22 '12 at 14:06
@Adnan. I understand, but deduplication tag is completely abandoned. How can a question like this not be tagged as deduplication? — Leaurus, Oct 22 '12 at 14:13
@Leaurus, many of the questions in your editing history (including this one) can be improved in other ways including tags. For example, you can cleanup by removing signatures and greetings ("hello, how are you?", "Thanks for the help", "Bye") or checking dead links, maybe some spelling or grammar issues. The editing guidelines says that edits must be substantial and target several issues in the post. — Adi, Oct 22 '12 at 14:42
@Adnan , thank you. my intention when I was tagging was to fill up the deduplication tag with related posts. But I will consider your advice for my next editings. Thank you for your attention. =) — Leaurus, Oct 22 '12 at 18:11

score 2 · Answer 1 · answered Oct 31 '13 at 09:35

Global deduplication as you have described is typically handled outside of most typical virtual machine backup programs because CBT already tells you what blocks changed in a VM so you don't have to take a full backup every time. Global dedupe tends to be resource intensive too, so most folks would just get a Data Domain instead and take advantage of hardware (SSDs) and software (custom filesystems, variable length dedupe) that are dedicated, configured and optimized for deduping. Conceivably the backup program you are creating could take advantage of both CBT along with Data Domain's offerings in a way that some commercially available backup software are already able to do, like [Veeam][3]. More info on Data Domain's dedupe strategy ([variable length segments][4]).

well i had to delete two of my urls to post this answer cuz apparently i dont have enough rep... w/e

Data deduplication framework?

1 Answers1