When I started to use big data technologies, I learn that the fundamental rule is "move the code, not the data". But I realise I don't know how that works: how does spark know where to move the code?
I'm speaking here about the very first steps, eg: read from a distributed file and a couple of map ops.
- In case of a hdfs file, how does spark knows where the actual data parts are? What is the tool/protocol at work?
- Is it different depending on the resource manager (stand-alone-spark/yarn/mesos)?
- What about on-top-of-hdfs storage app, such as hbase/hive?
- what about other distributed storage if they are running on the same machines (such as kafka)?
- Apart from spark, is it the same for similar distributed engine, such as storm/flink?
edit
For cassandra + spark, it seems that the (specialized) connector manages this data locality: https://stackoverflow.com/a/31300118/1206998