7

Question is simple:

master_dim.py calls dim_1.py and dim_2.py to execute in parallel. Is this possible in databricks pyspark?

Below image is explaning what am trying to do, it errors for some reason, am i missing something here?

enter image description here

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Chandra
  • 371
  • 3
  • 10

3 Answers3

14

Just for others in case they are after how it worked:

from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)
Chandra
  • 371
  • 3
  • 10
  • you can just use `path` - in this case it's easier to move project into new folder, etc. if the `path` isn't absolute, then it's treated as relative to the current notebook – Alex Ott Aug 27 '21 at 06:16
  • The limitation with this approach is you can't share dependencies with the parallel jobs. I hope databricks can improve this so we can pass not only strings to the called notebook – hui chen May 11 '23 at 08:21
  • I will create level 2 list and run after the level 1 list has completed. Gives control. – Chandra May 12 '23 at 10:36
4

your problem is that you're passing only Test/ as first argument to the dbutils.notebook.run (the name of notebook to execute), but you don't have notebook with such name.

You need either modify list of paths from ['Threading/dim_1', 'Threading/dim_2'] to ['dim_1', 'dim_2'] and replace dbutils.notebook.run('Test/', ...) with dbutils.notebook.run(path, ...)

Or change dbutils.notebook.run('Test/', ...) to dbutils.notebook.run('/Test/' + path, ...)

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
0

Databricks now has workflows/multitask jobs. Your master_dim can call other jobs to execute in parallel after finishing/passing taskvalue parameters to dim_1, dim_2 etc.

inder
  • 61
  • 2