I work with the airflow operators and mlproject. So the pipeline looks like the following : I define airflow operator, in which I specify entry point of MLProject and parameters for the program, then it turns to MLProject file where running command is specified, so it has the following form:
dq_checking = MLProjectOperator(
task_id = 'dq_checking',
dag = dag,
project_name = 'my_project',
project_version = '$VERSION',
entry_point = 'data_validation',
base_image = 'my_docker_image',
base_image_version = '0.1.1',
hadoop_env = 'SANDBOX',
project_parameters = {
'tag': default_args['tag']
},
environment = {
'some_environment_variable' : 'some_value'
}
)
Here MLProjectOperator
is inherited from DockerOperator
. Then I have MLProject
name: my_project
conda_env: conda.yml
entry_points:
data_validation:
parameters:
tag: string
command: >
spark-submit --master yarn --deploy-mode cluster
--num-executors 20 --executors-cores 4
my_project/scoring/dq_checking.py -t {tag}
The question is, I need to read some files in script dq_checking.py
. Locally they are in the same directory my_project/scoring
. But as far as I understand in production this script is running in its own docker container and therefore there are no those files. How can I add them to the container?