How spark select where to run w.r.t hdfs

Question

When I started to use big data technologies, I learn that the fundamental rule is "move the code, not the data". But I realise I don't know how that works: how does spark know where to move the code?

I'm speaking here about the very first steps, eg: read from a distributed file and a couple of map ops.

In case of a hdfs file, how does spark knows where the actual data parts are? What is the tool/protocol at work?
Is it different depending on the resource manager (stand-alone-spark/yarn/mesos)?
What about on-top-of-hdfs storage app, such as hbase/hive?
what about other distributed storage if they are running on the same machines (such as kafka)?
Apart from spark, is it the same for similar distributed engine, such as storm/flink?

edit

For cassandra + spark, it seems that the (specialized) connector manages this data locality: https://stackoverflow.com/a/31300118/1206998

Maybe this can help : http://horicky.blogspot.my/2015/02/big-data-processing-in-spark.html — RoyaumeIX, May 17 '16 at 07:44
This is a nice introduction to spark engine, but it doesn't answer my question :-\ — Juh_, May 17 '16 at 08:05

score 2 · Accepted Answer · edited Jul 14 '19 at 08:00

1) Spark asks Hadoop for how input files is distributed into splits (another good explanation on splits) and turns splits into partitions. Check code of Spark's NewHadoopRDD:

override def getPartitions: Array[Partition] = {
  val inputFormat = inputFormatClass.newInstance
  inputFormat match {
    case configurable: Configurable =>
      configurable.setConf(_conf)
        case _ =>
      }
    val jobContext = newJobContext(_conf, jobId)
    val rawSplits = inputFormat.getSplits(jobContext).toArray
    val result = new Array[Partition](rawSplits.size)
    for (i <- 0 until rawSplits.size) {
      result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
    }
  result
}

2) It's not. It depends on Hadoop InputFormat of the file.

3) The same.

4) Mechanism is similar, for example KafkaRDD implementation maps Kafka partitions into Spark partitions one-to-one.

5) I believe they use the same mechanism.

So I understand that there are *connectors* implemented for each input app/engine (hdfs, kafka, cassandra, ...). And these connectors will actually map their partitioning system into rdd partitions. — Juh_, May 27 '16 at 13:42

How spark select where to run w.r.t hdfs

edit

1 Answers1