2

I have a cluster with HDFS as an under storage distributed file system, but I've just read about alluxio that is fast and flexible. So, My question is: Should I use Alluxio with HDFS or Alluxio is alternative for HDFS? (I see in their site that shared storage for under storage file system can be network file system (NFS). So, I think HDFS is not required. Correct me if I make a mistake).

In which mode performance is better: HDFS with Alluxio or Alluxio stanalone (what I mean the term standalone is to be used alone in the cluster and not locally).

dtolnay
  • 9,621
  • 5
  • 41
  • 62
DAVID_ROA
  • 309
  • 1
  • 3
  • 18
  • AFAIK, it's an alternative. Similar to MapRFS, perhaps, and competes with IgniteFs – OneCricketeer Aug 30 '18 at 14:09
  • So, if it is alternative why it needs a shared under storage system like HDFS, NFS, S3 and etc ... ? HDFS or others does not need these shared under storage system and work with local file systems of cluster's machines. – DAVID_ROA Aug 30 '18 at 14:34
  • 1
    Similar to how HDFS is an abstraction over local machines filesystems, Alluxio is an abstraction over other storage layers such as HDFS, but it is not a requirement, therefore it is an alternative -- See https://www.alluxio.org/docs/1.8/en/Alluxio-Storage.html rather than "Under Stores" – OneCricketeer Aug 30 '18 at 18:45

1 Answers1

10

Reply from Alluxio maintainer.

First of all, Alluxio is not a replacement for HDFS. Instead, it is a new abstraction layer on top of other distributed/cloud storage systems including HDFS, S3, Azure Object Store and other possible choices. In your case, if you data is already in HDFS, you will perhaps still keep HDFS as the persistent data layer for Alluxio.

The typical scenarios users put Alluxio in the picture and see significant benefits include:

  • Your physical data is not located with your compute. E.g., your bigdata engine is reading data from S3 or other object storage. In this case, by deploying Alluxio with compute nodes, one can make Alluxio work as a filesystem level cache to avoid fetching data across network repeatedly. See http://www.alluxio.org/overview/remote-data-acceleration
  • You are managing multiple storages and want to expose a single data access layer to simplify the management. E.g., one can "mount" multiple S3/ buckets into one Alluxio deployment so they appear as different directories under the same namespace. See http://www.alluxio.org/overview/storage-unification

Regarding your original performance question. The answer is, it depends. If your HDFS is remote from compute, you would expect a good performance gain. I also saw cases when HDFS is bottlenecked, Alluxio may also help to reduce the load and provides good SLA for certain mission-critical jobs.

apc999
  • 250
  • 3
  • 6
  • So in my case Is there any benefit to use Alluxio on top of HDFS? (Given that I use Spark (that itself has in-memory processing engine) and HDFS is not remote and my data nodes are same as compute nodes). – DAVID_ROA Sep 05 '18 at 05:14
  • I don't think this is a target scenario for Alluxio to deliver significant performance benefit. In addition, it is always better to understand if your Spark jobs are I/O intensive or compute intensive---in latter case, speeding up I/O portion will hardly help the end-to-end performance any way. – apc999 Sep 06 '18 at 06:44