Questions tagged [alluxio]

Alluxio is an open source memory-centric distributed file system written in Java. It acts as an in-memory data caching layer between applications and data storage systems. The software is published under the Apache License.

Alluxio (formerly Tachyon) is an open source memory-speed distributed file system. It is a data layer between compute and storage, abstracting the files or objects in underlying persistent storage systems and providing a shared data access layer for compute applications. Alluxio was developed in University of California, Berkeley AMPLab.

Alluxio can be used as a distributed shared caching service for big data analytics like , , etc, so that compute applications talking to Alluxio can transparently cache frequently accessed data, especially data from remote locations, to provide in-memory I/O throughput

Alluxio can also simplify cloud and object storage adoption: Cloud and object storage systems use different semantics that have performance implications compared to traditional file systems. For example, when accessing data in cloud storage there is no node-level locality or cross-application caching. There are also different performance characteristics in common file system operations like directory listing (‘ls’) and ‘rename’, which often add significant overhead to analytics. Deploying Alluixo with cloud or object storage can close the semantics gap and achieve significant performance gains.

Alluxio is written in and hosted on github.

The latest stable version:

Recommended reference sources:

90 questions
70
votes
2 answers

Errors when using OFF_HEAP Storage with Spark 1.4.0 and Tachyon 0.6.4

I am trying to persist my RDD using off heap storage on spark 1.4.0 and tachyon 0.6.4 doing it like this : val a = sqlContext.parquetFile("a1.parquet") a.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) a.count() Afterwards I am getting the…
qwertz1123
  • 1,173
  • 10
  • 27
8
votes
1 answer

Is Tachyon by default implemented by the RDD's in Apache Spark?

I'm trying to understand Spark's in memory feature. In this process i came across Tachyon which is basically in memory data layer which provides fault tolerance without replication by using lineage systems and reduces re-computation by…
7
votes
1 answer

Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode

I'm trying to understand if the Spark Driver is a single point of failure when deploying in cluster mode for Yarn. So I'd like to get a better grasp of the innards of the failover process regarding the YARN Container of the Spark Driver in this…
MiguelPeralvo
  • 837
  • 1
  • 11
  • 19
5
votes
1 answer

Spark Tachyon: How to delete a file?

In Scala, as an experiment I create a sequence file on Tachyon using Spark and read it back in. I want to delete the file from Tachyon using the Spark script also. val rdd = sc.parallelize(Array(("a",2), ("b",3),…
bjjer
  • 983
  • 1
  • 7
  • 7
4
votes
0 answers

Spark concurrency performance issue Vs Presto

We are benchmarking spark with alluxio and presto with alluxio. For evaluating the performance we took 5 different queries (with some joins, group by and sort) and ran this on a dataset 650GB in orc. Spark execution environment is setup in such a…
Rijo Joseph
  • 1,375
  • 3
  • 17
  • 33
4
votes
1 answer

Alluxio Error:java.lang.IllegalArgumentException: Wrong FS

I am able to run wordcount on alluxio with an example jar provided by cloudera, using: sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar wordcount -libjars…
Sambhu R
  • 43
  • 4
4
votes
1 answer

What's the difference between Apache Ignite and Tachyon

I am new to Apache Ignite,for the Ignite and spark integration, it looks that Ignite provides an in-memory layer that the data will live across spark applications, which is the capability that Tachyon provides as an in-memory File System. So, my…
Tom
  • 5,848
  • 12
  • 44
  • 104
4
votes
1 answer

How to Tachyon to share data between Spark jobs

I'm a beginner with Tachyon. I want to share some data or rdd between spark jobs. Tachyon overview says Tachyon is an open source memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster jobs. But I…
4
votes
2 answers

What is difference between distributed cache and Tachyon?

Distributed cache is a method that store common requests and enabling quick retrieval. Tachyon is a memory-centric distributed storage file system that avoids going to disk to load datasets that are frequently read. What is the different between…
3
votes
2 answers

How to convert spark RDD to mahout DRM?

I am fetching data from Alluxio in Mahout using sc.textFile(), but it is spark RDD. My program further uses this spark RDD as Mahout DRM, therefore I needed to convert RDD to DRM. So my current code remains stable.
2
votes
1 answer

Why do mtime and atime need to be updated?

Does anyone know why the mtime and atime need to be updated when completing the file? mInodeTree.updateInode(rpcContext, UpdateInodeEntry.newBuilder() .setId(inode.getId()) .setUfsFingerprint(ufsFingerprint) …
ChanChan Mao
  • 157
  • 8
2
votes
1 answer

The difference between invoke maven directly in shell and invoke it from intellij IDEA

Edit 3: I also tried to set maven proxy through java option parameters mentioned at this thread. Edit 2: I'm sure intellij idea are using same settings.xml, same maven binary and the same local repository as system maven. Edit 1: I tried to…
Eugene
  • 10,627
  • 5
  • 49
  • 67
2
votes
1 answer

Hive: modify external table's location take too long

Hive has two kinds of tables which are Managed and External Tables, for the difference, you can check Managed. VS External Tables. Currently, to move external database from HDFS to Alluxio, I need to modify external table's location to…
Eugene
  • 10,627
  • 5
  • 49
  • 67
2
votes
1 answer

Spark job failed to write to Alluxio due to DeadlineExceededException

I am running a Spark job writing to an Alluxio cluster with 20 workers (Alluxio 1.6.1). Spark job failed to write its output due to alluxio.exception.status.DeadlineExceededException. The worker is still alive from Alluxio WebUI. How can I avoid…
apc999
  • 250
  • 3
  • 6
2
votes
1 answer

Alluxio with/without HDFS

I have a cluster with HDFS as an under storage distributed file system, but I've just read about alluxio that is fast and flexible. So, My question is: Should I use Alluxio with HDFS or Alluxio is alternative for HDFS? (I see in their site that…
DAVID_ROA
  • 309
  • 1
  • 3
  • 18
1
2 3 4 5 6