2

When I started to use big data technologies, I learn that the fundamental rule is "move the code, not the data". But I realise I don't know how that works: how does spark know where to move the code?

I'm speaking here about the very first steps, eg: read from a distributed file and a couple of map ops.

  1. In case of a hdfs file, how does spark knows where the actual data parts are? What is the tool/protocol at work?
  2. Is it different depending on the resource manager (stand-alone-spark/yarn/mesos)?
  3. What about on-top-of-hdfs storage app, such as hbase/hive?
  4. what about other distributed storage if they are running on the same machines (such as kafka)?
  5. Apart from spark, is it the same for similar distributed engine, such as storm/flink?

edit

For cassandra + spark, it seems that the (specialized) connector manages this data locality: https://stackoverflow.com/a/31300118/1206998

Community
  • 1
  • 1
Juh_
  • 14,628
  • 8
  • 59
  • 92

1 Answers1

2

1) Spark asks Hadoop for how input files is distributed into splits (another good explanation on splits) and turns splits into partitions. Check code of Spark's NewHadoopRDD:

override def getPartitions: Array[Partition] = {
  val inputFormat = inputFormatClass.newInstance
  inputFormat match {
    case configurable: Configurable =>
      configurable.setConf(_conf)
        case _ =>
      }
    val jobContext = newJobContext(_conf, jobId)
    val rawSplits = inputFormat.getSplits(jobContext).toArray
    val result = new Array[Partition](rawSplits.size)
    for (i <- 0 until rawSplits.size) {
      result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
    }
  result
}

2) It's not. It depends on Hadoop InputFormat of the file.

3) The same.

4) Mechanism is similar, for example KafkaRDD implementation maps Kafka partitions into Spark partitions one-to-one.

5) I believe they use the same mechanism.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Vitalii Kotliarenko
  • 2,947
  • 18
  • 26
  • So I understand that there are *connectors* implemented for each input app/engine (hdfs, kafka, cassandra, ...). And these connectors will actually map their partitioning system into rdd partitions. – Juh_ May 27 '16 at 13:42