1

We're running airflow in google composer, and we're running into difficulties with the the GoogleSheetsToGCSOperator. We're using composer 2, and therefore I understand that we have to make sure to use a connection with the correct scopes. So that's fine, I've set up a connection with those scopes, and we now no longer get permission errors. However, the dag still doesn't work, it now fails in a couple of different ways.

Most of the time, any dag that tries to upload a google sheet to GCS fails with error Negsignal.SIGKILL. For example:

--------------------------------------------------------------------------------
[2022-10-03, 15:50:55 UTC] {taskinstance.py:1251} INFO - Starting attempt 1 of 1
[2022-10-03, 15:50:55 UTC] {taskinstance.py:1252} INFO - 
--------------------------------------------------------------------------------
[2022-10-03, 15:50:55 UTC] {taskinstance.py:1271} INFO - Executing <Task(GoogleSheetsToGCSOperator): upload_sheet_to_gcs_airflow_permission_test_sheet> on 2022-10-03 15:50:38.412899+00:00
[2022-10-03, 15:50:55 UTC] {standard_task_runner.py:52} INFO - Started process 529848 to run task
[2022-10-03, 15:50:55 UTC] {standard_task_runner.py:79} INFO - Running: ['airflow', 'tasks', 'run', 'test_brunel_core_2', 'upload_sheet_to_gcs_airflow_permission_test_sheet', 'manual__2022-10-03T15:50:38.412899+00:00', '--job-id', '7342', '--raw', '--subdir', 'DAGS_FOLDER/DAGs/z_airflow_testing_dags/test_brunel_2_functions.py', '--cfg-path', '/tmp/tmpyuhkixqc', '--error-file', '/tmp/tmp7p2delaz']
[2022-10-03, 15:50:55 UTC] {standard_task_runner.py:80} INFO - Job 7342: Subtask upload_sheet_to_gcs_airflow_permission_test_sheet
/opt/python3.8/lib/python3.8/site-packages/airflow/utils/log/file_task_handler.py:110: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/airflow/gcs/logs/test_brunel_core_2/upload_sheet_to_gcs_airflow_permission_test_sheet/2022-10-03T15:50:38.412899+00:00/1.log' mode='a' encoding='utf-8'>
  self.handler = NonCachingFileHandler(local_loc, encoding='utf-8')

[2022-10-03, 15:50:56 UTC] {task_command.py:298} INFO - Running <TaskInstance: test_brunel_core_2.upload_sheet_to_gcs_airflow_permission_test_sheet manual__2022-10-03T15:50:38.412899+00:00 [running]> on host airflow-worker-j28mn
[2022-10-03, 15:50:56 UTC] {taskinstance.py:1448} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=process_dev_joe_m
AIRFLOW_CTX_DAG_ID=test_brunel_core_2
AIRFLOW_CTX_TASK_ID=upload_sheet_to_gcs_airflow_permission_test_sheet
AIRFLOW_CTX_EXECUTION_DATE=2022-10-03T15:50:38.412899+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-10-03T15:50:38.412899+00:00
[2022-10-03, 15:51:02 UTC] {local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL
[2022-10-03, 15:51:02 UTC] {taskinstance.py:1279} INFO - Marking task as FAILED. dag_id=test_brunel_core_2, task_id=upload_sheet_to_gcs_airflow_permission_test_sheet, execution_date=20221003T155038, start_date=20221003T155055, end_date=20221003T155102

The rest of the time, some random task in the dag fails (not neccesarily the step with the GoogleSheetsToGCSOperator). Sometimes it a step fails with absolutely no log being generated at all, or sometimes log is generated but it contains no errors. Instead, the only clue is a warning:

/opt/python3.8/lib/python3.8/site-packages/airflow/utils/log/file_task_handler.py:110: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/airflow/gcs/logs/test_flakiness/create_table_JM_test_table.create/2022-10-04T09:11:58.425115+00:00/1.log' mode='a' encoding='utf-8'>
  self.handler = NonCachingFileHandler(local_loc, encoding='utf-8')

The weird thing about that warning is that it's warning about the log file itself. As in, that message is written into log file gs://europe-west1-process-dev-ai-fd1dc540-bucket/logs/test_flakiness/create_table_JM_test_table.create/2022-10-04T09:11:58.425115+00:00/1.log. So of course the file is open, you're writing to it, so why are you warning about it being open?

Some other facts that may or may not be relevant:

  • composer-2.0.25 airflow-2.2.5
  • When monitoring the environment, all resources (cpu, memory, etc) seem to be fine, nothing is hitting its limits.
  • Our environment is configured to use between 1 and 4 workers. Only ever one worker is used, so I don't think it can be a problem with multiple workers all trying to write to the same file at once.
  • This is all happening in our test environment. The same dag will work absolutely fine in our prod environment. Our prod environment is running composer-1.19.3-airflow-2.2.5, and therefore is set up differently when it comes to things like Google drive authentication scopes. So that's already 2 potential reasons why things are different in the prod environment.
Joe
  • 11
  • 1
  • You might check this SO post, this might be helpful: https://stackoverflow.com/questions/69231797/airflow-dag-fails-when-pythonoperator-with-error-negsignal-sigkill. If not, can you provide sample DAG, input file, etc. for me to reproduce your scenario. – Anjela B Oct 05 '22 at 02:28

0 Answers0