what will happen in a cluster environment when I do spark.textFile("hdfs://...log.txt")

Question

Guys I am new to Spark and learnt some basic concept of Spark. Although I now have some understanding of the concepts such as partition, stage, tasks,transformation but I find it is a bit difficult for me to connect these concepts or dots.

Assuming the file has 4 lines(each line take 64MB so it is the same as the size of each partition by default) and I have one master node and 4 slave nodes.

val input = spark.textFile("hdfs://...log.txt")
#whatever transformation here
val splitedLines = input.map(line => line.split(" "))
                    .map(words => (words(0), 1))
                    .reduceByKey{(a,b) => a + b}

I am wondering what will happen on the master node and slave node?

Here is my understanding please correct me if I am wrong. When I start the context SparkContext, each worker starts an executor according to this post What is a task in Spark? How does the Spark worker execute the jar file?

Then the application will get pushed to the slave node

Will each of the 4 slave nodes read one line from the file? If so, that means on each slave node, a RDD will be generated? Then DAG will be generated based on RDD and stage will be built and tasks will be identified as well. In this case, each slave node has one RDD and one partition to hold the RDD.

OR, Will the master node read the entire file and build a RDD,then DAG, then stage, and then only push the task to the slave nodes and then the slave node will only process tasks such as map, filter or reduceByKey. But if this is the case, how will the slave nodes read the file? How the file or RDD is distributed among the slaves?

What I am looking for is to understand the flow step by step and to understand where each step happens, on the master node or slave node?

thank you for your time. cheers

score 0 · Accepted Answer · answered Jun 20 '16 at 08:47

Will each of the 4 slave nodes read one line from the file?

Yes, since the file is split the file will read parallely. (Tuneable property # of lines to read)

How the file or RDD is distributed among the slaves?

HDFS takes care of the splitting and spark workers will be responsible for reading.

Source : here https://github.com/jaceklaskowski/mastering-apache-spark-book

what will happen in a cluster environment when I do spark.textFile("hdfs://...log.txt")

1 Answers1