8

I'm running Airflow on a clustered environment running on two AWS EC2-Instances. One for master and one for the worker. The worker node though periodically throws this error when running "$airflow worker":

[2018-08-09 16:15:43,553] {jobs.py:2574} WARNING - The recorded hostname ip-1.2.3.4 does not match this instance's hostname ip-1.2.3.4.eco.tanonprod.comanyname.io
Traceback (most recent call last):
  File "/usr/bin/airflow", line 27, in <module>
    args.func(args)
  File "/usr/local/lib/python3.6/site-packages/airflow/bin/cli.py", line 387, in run
    run_job.run()
  File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 198, in run
    self._execute()
  File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 2527, in _execute
    self.heartbeat()
  File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 182, in heartbeat
    self.heartbeat_callback(session=session)
  File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 50, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 2575, in heartbeat_callback
    raise AirflowException("Hostname of job runner does not match")
airflow.exceptions.AirflowException: Hostname of job runner does not match
[2018-08-09 16:15:43,671] {celery_executor.py:54} ERROR - Command 'airflow run arl_source_emr_test_dag runEmrStep2WaiterTask 2018-08-07T00:00:00 --local -sd /var/lib/airflow/dags/arl_source_emr_test_dag.py' returned non-zero exit status 1.
[2018-08-09 16:15:43,681: ERROR/ForkPoolWorker-30] Task airflow.executors.celery_executor.execute_command[875a4da9-582e-4c10-92aa-5407f3b46d5f] raised unexpected: AirflowException('Celery command failed',)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 52, in execute_command
    subprocess.check_call(command, shell=True)
  File "/usr/lib64/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'airflow run arl_source_emr_test_dag runEmrStep2WaiterTask 2018-08-07T00:00:00 --local -sd /var/lib/airflow/dags/arl_source_emr_test_dag.py' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/dist-packages/celery/app/trace.py", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python3.6/dist-packages/celery/app/trace.py", line 641, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 55, in execute_command
    raise AirflowException('Celery command failed')
airflow.exceptions.AirflowException: Celery command failed

When this error occurs the task is marked as failed on Airflow and thus fails my DAG when nothing actually went wrong in the task.

I'm using Redis as my queue and postgreSQL as my meta-database. Both are external as AWS services. I'm running all of this on my company environment which is why the full name of the server is ip-1.2.3.4.eco.tanonprod.comanyname.io. It looks like it wants this full name somewhere but I have no idea where I need to fix this value so that it's getting ip-1.2.3.4.eco.tanonprod.comanyname.io instead of just ip-1.2.3.4.

The really weird thing about this issue is that it doesn't always happen. It seems to just randomly happen every once in a while when I run the DAG. It's also occurring on all of my DAGs sporadically so it's not just one DAG. I find it strange though how it's sporadic because that means other task runs are handling the IP address for whatever this is just fine.

Note: I've changed the real IP address to 1.2.3.4 for privacy reasons.

Answer:

https://github.com/apache/incubator-airflow/pull/2484

This is exactly the problem I am having and other Airflow users on AWS EC2-Instances are experiencing it as well.

Kyle Bridenstine
  • 6,055
  • 11
  • 62
  • 100
  • The issue suddenly occurred to me today, with Airflow v2.3.4. The github PR does handle the issue, but we still need to activate it by setting the configuration: `hostname_callable = airflow.utils.net.get_host_ip_address` This will translate both `ip-1-1-5-2` and `ip-1-1-5-2.ap-south-1.compute.internal` to `1.1.5.2`. – Santosh Satvik L Oct 07 '22 at 13:14

3 Answers3

5

The hostname is set when the task instance runs, and is set to self.hostname = socket.getfqdn(), where socket is the python package import socket.

The comparison that triggers this error is:

fqdn = socket.getfqdn()
if fqdn != ti.hostname:
    logging.warning("The recorded hostname {ti.hostname} "
        "does not match this instance's hostname "
        "{fqdn}".format(**locals()))
    raise AirflowException("Hostname of job runner does not match")

It seems like the hostname on the ec2 instance is changing on you while the worker is running. Perhaps try manually setting the hostname as described here https://forums.aws.amazon.com/thread.jspa?threadID=246906 and see if that sticks.

cwurtz
  • 3,177
  • 1
  • 15
  • 15
  • Ah, nice find. Glad you found the reason – cwurtz Aug 10 '18 at 01:17
  • cwurtz what do you think the best solution is in implementing this fix? I don't know if I should manually apply the fix to the Airflow file that needs it, or fork Airflow and cherry pick this commit that has the fix? – Kyle Bridenstine Aug 10 '18 at 14:31
  • Also it appears this is the ultimate reason I was having this other issue https://stackoverflow.com/questions/51365911/airflow-logs-brokenpipeexception which had a 50 point bounty that ended yesterday so you probably should have gotten that bounty! Sorry lol. – Kyle Bridenstine Aug 10 '18 at 14:38
  • Personally I'd fork Airflow, and cherry pick the change rather than do it outside of version control. That being said, further down in the thread there is another PR https://github.com/apache/incubator-airflow/pull/3036 which has a more robust solution and was merged already, and is a part of the 1.10 stable branch. Just depends if you want to upgrade to 1.10 yet or not. – cwurtz Aug 10 '18 at 14:51
  • I actually was in the process of setting up a new Airflow cluster to go from my current 1.9 to 1.10 to see if that would fix this error (I haven't tested on it yet though). One weird thing though is that I see 1.10 here https://github.com/apache/incubator-airflow/releases but when I run `pip install git+git://github.com/apache/incubator-airflow.git` I get version `v2.0.0.dev0+incubating` which I'm guessing is newer than 1.10? – Kyle Bridenstine Aug 10 '18 at 15:45
  • 1
    Try `pip install git+https://github.com/apache/incubator-airflow.git@v1-10-stable`, force it to use a stable branch – cwurtz Aug 11 '18 at 02:40
  • Got it thanks cwurtz. What's the `v2.0.0.dev0+incubating` version though? Is that like the next version that's going to be released but it's not officially released yet? – Kyle Bridenstine Aug 12 '18 at 23:18
  • Yeah, seems like it. I think it's expected that if you are installing via pip, you either specify the branch/tag or package name (ie. `pip install apache-airflow`) – cwurtz Aug 13 '18 at 00:15
  • we're having trouble installing Airflow version 1.10 using that command. It successfully installs Airflow and I can see it's version 1.10 by running `airflow version` but when I try to start up the scheduler we get an exception saying it's missing the MySqlDb module. I've created a new post for this here https://stackoverflow.com/questions/51861453/airflow-1-10-installation-failing?noredirect=1#comment90675468_51861453 – Kyle Bridenstine Aug 15 '18 at 15:33
  • The fix they put is wrong... github.com/apache/incubator-airflow/pull/2484/commits/… check out this post stackoverflow.com/a/36160103/3299397 the fix they put is giving us the hostname not the IP address. We have to use socket.gethostbyname(socket.gethostname()) – Kyle Bridenstine Aug 15 '18 at 17:37
  • >>> socket.gethostbyname(socket.gethostname()) '10.185.143.196' >>> socket.gethostname() 'ip-10-185-143-196' >>> socket.getfqdn() 'ip-10-185-143-196' – Kyle Bridenstine Aug 15 '18 at 17:46
  • In my 5th comment I linked to pull/3036. It looks like that provides the ability to use your own custom function, and is what I was referring to having been merged in 1.10. You should be able to configure a function in a module that calls `socket.gethostbyname(socket.gethostname())`. – cwurtz Aug 15 '18 at 20:17
  • In the `/usr/local/lib/python3.6/site-packages/airflow/utils/net.py` file of Airflow version 1.10 we see that this is the file that's being called now to get the IP. In that file they have these lines `# First we attempt to fetch the callable path from the config.` then `callable_path = conf.get('core', 'hostname_callable')` so it looks like we can just set it in our `airflow.cfg` file now. Sadly though we're stuck with this issue so we can't test it out yet https://stackoverflow.com/questions/51865634/airflow-1-10-scheduler-startup-fails – Kyle Bridenstine Aug 15 '18 at 20:20
5

I had a similar problem on my Mac. It fixed it setting hostname_callable = socket:gethostname in airflow.cfg.

Michael Kotliar
  • 301
  • 3
  • 3
-1

Personally when running on my Mac, I found that I got similar errors to this when the Mac would sleep while I was running a long job. The solution was to go into System Preferences -> Energy Saver and then check "Prevent computer from sleeping automatically when the display is off."

Stephen
  • 8,508
  • 12
  • 56
  • 96