4

My scalding job is translated into 9 map reduce jobs (m/r jobs). It's not easy for me to understand which part of code each m/r job represents. Is there anything that could help me understand my job better?

//this has been copy&pasted from our internal wiki at Tapad. Feel free to share your experience!

Oleksii
  • 1,101
  • 7
  • 12

2 Answers2

4

Scalding can generate a job graph in .dot format. It's triggered by this code. Here are the steps:

sbt
project mapreduce

run-main com.twitter.scalding.Tool com.company.YourJobClass \
  --tool.graph \
  --hdfs
  --arg1 value_1
  --arg2 value_2

You should have 2 files generated ending with .dot. They are text files. One is very detailed graph of all Cascading functions used by your job. The other file that ends with _steps.dot is a graph of m/r jobs. Open them in your favorite editor and try to find nodes and their connections.

It's possible to generate pdf or png files from .dot using graphviz. Here are the steps:

#if you don't have graphviz installed you can get it from brew on mac
brew install graphviz

#generate a pdf file
dot myjob_steps.dot -Tpdf myjob_steps.pdf

#generate a png file (could be huge!)
dot myjob_steps.dot -Tpng myjob_steps.png

Bonus tip: it could be still hard to figure out where each m/r job is in your code. Adding descriptions to your code will add them to the myjob_steps.dot file. Experiment with this function and regenerate the .dot file. This is where generating a .pdf file is not necessary. You can just open myjob_steps.dot in your favorite editor and use search to find descriptions you put to markup the code. You can find examples in the scalding repo.

Oleksii
  • 1,101
  • 7
  • 12
2

I've been using Sahale for this. It was pretty simple to set up but with the caveat that it only seems to work on scala 2.11.x and scalding 0.16.x (as of this writing). It visualizes the MapReduce job flow with scalding line numbers that relate the job. Since it's a database-backed web application, it stores previous runs and you can track job performance as you develop. Some metrics are missing when I run tracked jobs from IntelliJ, but they're all there when I run stuff on a real cluster.

This article does a good job of giving a tour of what Sahale does.

jpk
  • 281
  • 2
  • 11
  • 1
    Thanks for the tip! Seems like Driven from Cascading has similar functionality. I was looking for a quick way to get some sort of *explain* about my job without running it on a cluster. This is there 'tool.graph' is useful. – Oleksii Jun 13 '17 at 22:57