Questions tagged [distributed-cache]

Use this tag for questions related to DistributedCache; a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.

DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.

Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via urls are already present on the FileSystem at the path specified by the url and are accessible by every machine in the cluster.

The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. Jars may be optionally added to the classpath of the tasks, a rudimentary software distribution mechanism. Files have execution permissions. In older version of Hadoop Map/Reduce users could optionally ask for symlinks to be created in the working directory of the child task. In the current version symlinks are always created. If the URL does not have a fragment the name of the file or directory will be used. If multiple files or directories map to the same link name, the last one added, will be used. All others will not even be downloaded.

DistributedCache tracks modification timestamps of the cache files. Clearly the cache files should not be modified by the application or

168 questions
9
votes
1 answer

Confusion about distributed cache in Hadoop

What does the distribute cache actually mean? Having a file in distributed cache means that is it available in every datanode and hence there will be no internode communication for that data, or does it mean that the file is in memory in every…
Dhruv Kapur
  • 726
  • 1
  • 8
  • 24
8
votes
2 answers

Hadoop MapReduce log4j - log messages to a custom file in userlogs/job_ dir?

Its not clear to me as how one should configure Hadoop MapReduce log4j at a job level. Can someone help me answer these questions. 1) How to add support log4j logging from a client machine. i.e I want to use log4j property file at the client…
6
votes
0 answers

Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis

I have run below Test-1 and Test-2 for longer run for performance test with redis configuration values specified, still we see the highlighted error-1 & 2 message and cluster is failing for some time, few of our processing failing. How to solve this…
ravibeli
  • 484
  • 9
  • 30
6
votes
1 answer

Hazelcast spring configuration

Whats the difference between tag created in the applicationContext vs the one that is defined in the segment? How are they related? I am aware that in applicationContext would result in creation of a bean of type IMap…
Manish
  • 909
  • 1
  • 11
  • 23
6
votes
2 answers

Hadoop DistributedCache functionality in Spark

I am looking for a functionality similar to the distributed cache of Hadoop in Spark. I need a relatively small data file (with some index values) to be present in all nodes in order to make some calculations. Is there any approach that makes this…
Mikel Urkia
  • 2,087
  • 1
  • 23
  • 40
5
votes
1 answer

IDistributedCache Removing keys

I've recently started using the sql version of IDistributedCache on a dotnet core web api. How would you remove/invalidate a set of keys for say a specific user? I.e: I structured the keys to follow this…
Pieter
  • 4,721
  • 6
  • 19
  • 18
5
votes
1 answer

Hazelcast SlowOperationDetector to identify operations with less than 1 sec execution time

I have a performance use case by which I need to identify certain process() calls in the EntryProcessor which takes more than 300 milliseconds. I tried to make use of SlowOperationDetector with the following configuration.