0

I work with the airflow operators and mlproject. So the pipeline looks like the following : I define airflow operator, in which I specify entry point of MLProject and parameters for the program, then it turns to MLProject file where running command is specified, so it has the following form:

dq_checking = MLProjectOperator(
    task_id = 'dq_checking',
    dag = dag,
    project_name = 'my_project',
    project_version = '$VERSION',
    entry_point = 'data_validation', 
    base_image = 'my_docker_image',
    base_image_version = '0.1.1',
    hadoop_env = 'SANDBOX',
    project_parameters = {
        'tag': default_args['tag']
    },
    environment = {
        'some_environment_variable' : 'some_value'
    }
)
    
    

Here MLProjectOperator is inherited from DockerOperator. Then I have MLProject

name: my_project
conda_env: conda.yml
entry_points:
    data_validation:
        parameters:
            tag: string
        command: >
        spark-submit --master yarn --deploy-mode cluster 
        --num-executors 20 --executors-cores 4
        my_project/scoring/dq_checking.py -t {tag}

The question is, I need to read some files in script dq_checking.py. Locally they are in the same directory my_project/scoring. But as far as I understand in production this script is running in its own docker container and therefore there are no those files. How can I add them to the container?

Nourless
  • 729
  • 1
  • 5
  • 18
  • check out `docker cp` to do it on the fly otherwise to do it via your dockerfile : https://stackoverflow.com/questions/30455036/how-to-copy-file-from-host-to-container-using-dockerfile – JonSG Jan 11 '23 at 14:24

1 Answers1

0

It turned out that I should include path to files into file MANIFEST.in whn building up a project. Then via --files parameter of spark-submit command add it with script.

Nourless
  • 729
  • 1
  • 5
  • 18