2

I want to define an environment variable in Databricks init script and then read it in Pyspark notebook. I wrote this:

    dbutils.fs.put("/databricks/scripts/initscript.sh","""
#!/bin/bash
export env="dev"
pip install pretty-html-table==0.9.14
""", True)

the pyspark code:

import os
environment=os.getenv("env")

it gives:

TypeError: can only concatenate str (not "NoneType") to str

but the Pyspark notebook is not able to read the environment variable properly

Any idea how to fix this ?

Ecstasy
  • 1,866
  • 1
  • 9
  • 17
scalacode
  • 1,096
  • 1
  • 16
  • 38
  • Why not define it on the cluster level instead? then it will be propagated everywhere – Alex Ott May 02 '22 at 17:49
  • @AlexOtt what if you have multiple clusters, but want a global environment variable set? Further, any time a new cluster is made, you don't want people to have to remember to add the environment variable via the Spark configs tab in Advanced Options every time they create a new cluster – Pablo Boswell Mar 30 '23 at 02:55
  • one way to achieve this is to use cluster policies... But have you tried to use the existing answer ? – Alex Ott Apr 01 '23 at 12:03

1 Answers1

2

You cannot use normal export since then the variable will only be available to the init script subprocess.

Instead, use the following line in an init script to set an environment variable globally:

echo AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_HERE >> /etc/environment

This will write to the environment file of the cluster, which is read from any subprocess on the cluster.

In case you need admin permissions to edit the target file, you can use this instead:

echo AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_HERE | sudo tee -a /etc/environment
Thomas
  • 4,696
  • 5
  • 36
  • 71