1

I'm working on Apache Spark application which I submit to AWS EMR cluster from Airflow task.

In Spark application logic I need to read files from AWS S3 and information from AWS RDS. For example, in order to connect to AWS RDS on PostgreSQL from Spark application, I need to provide the username/password for the database.

Right now I'm looking for the best and secure way in order to keep these credentials in the safe place and provide them as parameters to my Spark application. Please suggest where to store these credentials in order to keep the system secured - as env vars, somewhere in Airflow or where?

alexanoid
  • 24,051
  • 54
  • 210
  • 410
  • 1
    Look at AWS blogs for secure ways to do it, here is one https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-credentialsprovider.html – sramalingam24 Feb 06 '19 at 14:26

2 Answers2

1

In Airflow you can create Variables to store this information. Variables can be listed, created, updated and deleted from the UI (Admin -> Variables). You can then access them from your code as follows:

from airflow.models import Variable
foo = Variable.get("foo")
Juta
  • 411
  • 1
  • 5
  • 12
1

Airflow has got us covered beautifully on credentials-management front by offering Connection SQLAlchemy model that can be accessed from WebUI (where passwords still remain hidden)

  • You can control the salt that Airflow uses to encrypt passwords while storing Connection-details in it's backend meta-db.

  • It also provides you extra param for storing unstructured / client-specific stuff such as {"use_beeline": true} config for Hiveserver2

  • In addition to WebUI, you can also edit Connections via CLI (which is true for pretty much every feature of Airflow)

  • Finally if your use-case involves dynamically creating / deleting a Connection, that is also possible by exploiting the underlying SQLAlchemy Session. You can see implementation details from cli.py

Note that Airflow treats all Connections equal irrespective of their type (type is just a hint for the end-user). Airflow distinguishes them on the basis of conn_id only

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131