Nuodb and HDFS as storage

Question

Using HDFS for Nuodb as storage. Would this have a performance impact?

If I understand correctly, HDFS is better suited for batch mode or write once and read many times, types of application. Would it not increase the latency for record to be fetch in case it needs to read from storage?

On top of that HDFS block size concept, keep the file size small that would increase the network traffic while data is being fetch. Am I missing something here? Please point out the same.

How would Nuodb manage these kind of latency gotchas?

Hi... Any taker for this question... I already posted similar query in Nuodb forum too.. Appreciate response... http://www.nuodb.com/community/showthread.php?186-NuoDB-HDFS&p=1066#post1066 — user1687711, Jan 29 '13 at 16:39

NuoDB Support · Accepted Answer · 2013-01-30T00:17:55.900

Good afternoon,

My name is Elisabete and I am the Technical Support Engineer over at NuoDB. I believe that I may have just answered this via your post on our own forum, but I'm responding here as well for anyone else who's curious.

First... a mini lesson on NuoDB architecture/layout:

The most basic NuoDB set-up includes:

Broker Agent
Transaction Engine (TE)
Storage Manager (SM) connected to an Archive Directory

Broker Agents keep track of all the moving parts in the domain (collection of machines hosting NuoDB processes) and provide client applications with connection information for the next available Transaction Engine.

Transaction Engines process incoming SQL requests and manage transactions.

Storage Managers read and write data to and from "disk" (Archive Directory)

All of these components can reside on a single machine, but an optimal set up would have them spread across multiple host machines (allowing each process to take full advantage of the host's available CPU/RAM). Also, while it's possible to run with just one of each component, this is a case where more is definitely more. Additional Brokers provide resiliency, additional TE's increase performance/speed and additional SM's ensure durability.

Ok, so now lets talk about Storage:

This is the "Archive Directory" that your storage manager is writing to. Currently, we support three modes of storage:

Local Files System
Amazon Web Services: Simple Storage volume (S3), Elastic Block Storage (EBS)
Hadoop Distributed Files System (HDFS)

So, to elaborate on how NuoDB works with HDFS... it doesn't know about the multiple machines that the HDFS layer is writing to. As far as the SM is concerned, it is reading and writing data atoms to a single directory. The HDFS layer decides how to then distribute and retrieve data to and from the cluster of machines it resides over.

And now to finally address the question of latency:

Here's the thing, whenever we introduce a remote storage device, we inevitably introduce some amount of additional latency because the SM now has further to go when reading/writing atoms to/from memory. HDFS likely adds a bit more, because now it needs to do it's magic divvying up, distributing, retrieving and reassembling data. Add to that discrepancy in network speed, etc.

I imagine that the gained disk space outweighs the cost in travel time, but this is something you'd have to decide on a case by case basis.

Now, all of that said... I haven't mentioned that TE and SM's both have the ability to cache data to local memory. The size of this cache is something you can set, when starting up each process. NuoDB uses a combination of Multi-Version Concurrency Control (MVCC) and a near constant stream of communication between all of the processes, to ensure that data held in cache is kept up to date with all of the changes happening within the system. Garbage Collection also kicks in and clears out atoms in a Least Recently Used order, when the cache grows close to hitting its limit.

All of this helps reduce latency, because the TE's can hold onto the data they reference most often and grab copies of data they don't have from sibling TE's. When they do resort to asking the SM's for data, there's a chance that the SM (or one of its sibling SM's) has a copy of the requested data in local cache, saving itself the trip out to the Archive Directory.

Whew.. that was a lot and I absolutely glossed over more than a few concepts. These topics are covered in greater depth via the new suite of white papers (and the new "green book") available on our main website. I'm currently also working on some visual guides, to help explain all of this.

If you'd like to know more about NuoDB or if I didn't quite answer your question.... please reach out to me directly via the NuoDB Community Forums (I respond to posts there, a bit faster).

Thank you, Elisabete

Technical Support Engineer at NuoDB

have you guys considered supporting Azure Table Storage as another archive directory that the SM can write to? — Mark13426, May 27 '14 at 13:31
Please suggest how to setup only storage manager in Amazon Web Services: Simple Storage volume (S3)? — Yellappa, Oct 31 '17 at 07:28
Please provide any process steps and documentation to achieve the storage manager. — Yellappa, Oct 31 '17 at 07:28
Installed NuoDB 3.0 Linux version in one machine and while creating database it is possible to give archive directory location is Amazon s3 buckets storage ? Here SM is running on one machine and archive location is pointing to Amazon s3 buckets(Storage area). i.e (Archive directory is different machine). If it is possible please share the information how to follow the process. — Yellappa, Nov 06 '17 at 14:15
I have SM running on CentOS machine and have another file system location (Ceph) with S3 API configured. Do you have an idea as how to configure or locate the S3 API inside SM configuration? How do we pass on security info so that S3 allows to receive the request? I have used the command as below but it didn't work out and didn't receive any error message also: start process sm archive myBucket.storage.myDomain.io host localhost database myDatabase initialize true — Yellappa, Nov 06 '17 at 14:17

Nuodb and HDFS as storage

1 Answers1