1

The bokeh server allows the user to execute practically any python code on call back.

I would like to know if it can be also used to run Spark jobs.

So far, I have found some ideas here (Best Practice to launch Spark Applications via Web Application?), but I am not sure.

To make it a little bit more specific:

  1. Bokeh server is a web application with 2 buttons.
  2. If button 1 clicked, spark job 1 (e.g. word frequency on data set 1) to be executed and some resulting data shown on the page.
  3. If button 2 clicked, spark job 2 (e.g. word frequency on data set 2) to be executed and some resulting data shown on the page.
Community
  • 1
  • 1
Karel Macek
  • 1,119
  • 2
  • 11
  • 24

1 Answers1

0

I know this thread is super old, but I had the exact same question recently.

I got Spark running in my bokeh app. What I did is not a production grade deployment but it does work and let people self-serve. A couple of things to note that made it work for me:

  1. I needed to instantiate Spark so that different users with their own bokeh session could properly access spark
  2. I made the callback non-blocking so that user could continue interacting while their spark job was running
    1. I also made a very crude display of the status of the spark job (leaves a lot to be desired)

Here is a simplified look at my bokeh server main.py (which is open source and you can see here - https://github.com/mozilla/overscripted-explorer/blob/22feeedaf655bd7058331a5217900b0d2f41448b/text_search/main.py)

Instantiating spark. The getOrCreate is the important thing here:

from pyspark import SparkContext, SQLContext

sc = SparkContext.getOrCreate()
spark = SQLContext(sc)

....

def do_spark_computation():
    ....
    df = spark.read.parquet(DATA_FILE)
    frac = sample_frac.value / 100  # sample_frac is a bokeh widget
    sample = df.sample(False, frac)
    ....

....

For the non-blocking, I cribbed from this example from the bokeh docs: https://docs.bokeh.org/en/latest/docs/user_guide/server.html#updating-from-unlocked-callbacks

from concurrent.futures import ThreadPoolExecutor
from functools import partial    

from bokeh.document import without_document_lock
from bokeh.io import curdoc
from tornado.gen import coroutine


EXECUTOR = ThreadPoolExecutor(max_workers=2)
doc = curdoc()  # It was important to set this up globally

def do_spark_computation():
    ....
    df = spark.read.parquet(DATA_FILE)
    frac = sample_frac.value / 100  # sample_frac is a bokeh widget
    sample = df.sample(False, frac)
    ....

@coroutine
@without_document_lock
def get_new_data():
    doc.add_next_tick_callback(function_updates_bokeh_models)
    results = yield EXECUTOR.submit(do_spark_computation)
    doc.add_next_tick_callback(partial(function_updates_bokeh_models, results))


apply_button.on_click(get_new_data)
bigreddot
  • 33,642
  • 5
  • 69
  • 122
birdsarah
  • 1,165
  • 8
  • 20