Terraform stucks when instance_count is more than 2 while using remote-exec provisioner

Question

I am trying to provision multiple Windows EC2 instance with Terraform's remote-exec provisioner using null_resource.

$ terraform -v Terraform v0.12.6 provider.aws v2.23.0 provider.null v2.1.2

Originally, I was working with three remote-exec provisioners (Two of them involved rebooting the instance) without null_resource and for a single instance, everything worked absolutely fine.
I then needed to increase the count and based on several links, ended up using null_resource. So, I have reduced the issue to the point where I am not even able to run one remote-exec provisioner for more than 2 Windows EC2 instances using null_resource.

Terraform template to reproduce the error message:

//VARIABLES

variable "aws_access_key" {
  default = "AK"
}
variable "aws_secret_key" {
  default = "SAK"
}
variable "instance_count" {
  default = "3"
}
variable "username" {
  default = "Administrator"
}
variable "admin_password" {
  default = "Password"
}
variable "instance_name" {
  default = "Testing"
}
variable "vpc_id" {
  default = "vpc-id"
}

//PROVIDERS
provider "aws" {
  access_key = "${var.aws_access_key}"
  secret_key = "${var.aws_secret_key}"
  region     = "ap-southeast-2"
}

//RESOURCES
resource "aws_instance" "ec2instance" {
  count         = "${var.instance_count}"
  ami           = "Windows AMI"
  instance_type = "t2.xlarge"
  key_name      = "ec2_key"
  subnet_id     = "subnet-id"
  vpc_security_group_ids = ["${aws_security_group.ec2instance-sg.id}"]
  tags = {
    Name = "${var.instance_name}-${count.index}"
  }
}

resource "null_resource" "nullresource" {
  count = "${var.instance_count}"
  connection {
    type     = "winrm"
    host     = "${element(aws_instance.ec2instance.*.private_ip, count.index)}"
    user     = "${var.username}"
    password = "${var.admin_password}"
    timeout  = "10m"
  }
   provisioner "remote-exec" {
     inline = [
       "powershell.exe Write-Host Instance_No=${count.index}"
     ]
   }
//   provisioner "local-exec" {
//     command = "powershell.exe Write-Host Instance_No=${count.index}"
//   }
//   provisioner "file" {
//       source      = "testscript"
//       destination = "D:/testscript"
//   }
}
resource "aws_security_group" "ec2instance-sg" {
  name        = "${var.instance_name}-sg"
  vpc_id      = "${var.vpc_id}"


//   RDP
  ingress {
    from_port   = 3389
    to_port     = 3389
    protocol    = "tcp"
    cidr_blocks = ["CIDR"]
    }

//   WinRM access from the machine running TF to the instance
  ingress {
    from_port   = 5985
    to_port     = 5985
    protocol    = "tcp"
    cidr_blocks = ["CIDR"]
    }

  tags = {
    Name        = "${var.instance_name}-sg"
  }

}
//OUTPUTS
output "private_ip" {
  value = "${aws_instance.ec2instance.*.private_ip}"
}

Observations:

With one remote-exec provisioner, it works fine if count is set to 1 or 2. With count 3, it's unpredictable that all the provisioners will run everytime on all the instances. However one thing is for sure that Terraform never completes and does not show the output variables. It keeps showing "null_resource.nullresource[count.index]: Still creating..."
For the local-exec provisioner - Everything works fine. Tested with count's value as 1, 2 and 7.
For file provisioner its working fine for 1, 2 and 3 however does not finish for 7 but the file was copied on all the 7 instances. It keeps showing "null_resource.nullresource[count.index]: Still creating..."
Also, in every attempt, remote-exec provisioner is able to connect to the instances irrespective of count's value and it's just that, it's doesnt trigger the inline command and randomly chooses to skip that and starts showing "Still creating..." message.
I have been stuck with this issue for quite some time now. Couldnt find anything significant in debug logs as well. I know Terraform is not recommended to be used as a config mgmt tool however, everything's working fine even with complex provisioning scripts if the instance count is just 1 (Even without null_resource) which indicates that it should be easily possible for Terraform to handle such a basic provisioning requirement.
TF_DEBUG logs:
count=2, TF completes successfully and shows Apply complete!.
count=3, TF runs the remote-exec on all the three instances however does not complete and doesn't not show the outputs variables. Stuck at "Still creating..."
count=3, TF runs the remote-exec only on two instances and skips on nullresource[1] , does not complete and doesn't not show the outputs variables. Stuck at "Still creating..."
Any pointers will be greatly appreciated!

Don't oversimplify like that please, it's not helpful to work out where you're going wrong. Reduce your example to a [mcve] that other people can run but still see the same error as you. — ydaetskcoR, Aug 06 '19 at 07:11
Thanks for the feedback. I have modified the code so it can be reproduced now. — st_rt_dl_8, Aug 07 '19 at 02:15
@ydaetskcoR Added the template so that it's easy to reproduce than earlier snippet. — st_rt_dl_8, Aug 11 '19 at 12:58

Aleksi · Accepted Answer · 2019-08-14T05:16:14.707

4

Update: what eventually did the trick was downgrading Terraform to v11.14 as per this issue comment.

A few things you can try:

Inline remote-exec:

resource "aws_instance" "ec2instance" {
  count         = "${var.instance_count}"
  # ...
  provisioner "remote-exec" {
    connection {
      # ...
    }
    inline = [
      # ...
    ]
  }
}

Now you can refer to self inside the connection block to get the instance's private IP.

Add triggers to null_resource:

resource "null_resource" "nullresource" {
  triggers {
    host    = "${element(aws_instance.ec2instance.*.private_ip, count.index)}" # Rerun when IP changes
    version = "${timestamp()}" # ...or rerun every time
  }
  # ...
}

You can use the triggers attribute to recreate null_resource and thus re-execute remote-exec.

edited Aug 14 '19 at 05:16

answered Aug 13 '19 at 06:33

Aleksi

4,483
33
45

Thanks for the suggestion Aleksi. I had already tried #1 and ended up using null_resource when I faced the same issue with remote-exec inside aws_instance block. Tried #2 however still the same issue. Terraform skipped to run the provisioner on one of the instances and got stuck at "Still creating...". – st_rt_dl_8 Aug 13 '19 at 10:16
1

How about adding `sleep` at the beginning and end of your `inline` command? As per [this answer](https://stackoverflow.com/a/51777995/1763012). Some people also [report a similar issue](https://github.com/hashicorp/terraform/issues/22006#issuecomment-509588621) being fixed by downgrading to terraform `v11.14`, that might not be an option in your case though? – Aleksi Aug 13 '19 at 10:46
1

Tried that once again (Put sleep before and after the command). Did not work. What happened was, as mentioned in the original post, Terraform ran the provisioner on all the three instances, showed that the creating of one resource completed and then got stuck with "Still creating..." message for the other two instances and never showed the "Apply complete!" green message. Though [this](https://github.com/hashicorp/terraform/issues/22006#issuecomment-509588621) issue talks about file provisioner, I will still try downgrading and update soon. – st_rt_dl_8 Aug 13 '19 at 13:26
I downgraded the version to v11.14 and that magically worked. Seems like a bug in v0.12.6. Thanks a ton for your time on this! It really sucks to spend weeks on an issue to later find out something like this :) I will try to bring this to Hashicorp's attention. Meanwhile, could you please write the same in the answer so that I can accept it? – st_rt_dl_8 Aug 14 '19 at 04:48
I am having a VERY similar problem with the `chef` provisioner: [stack overflow link](https://stackoverflow.com/questions/57929171/chef-provisioner-in-terraform-hangs-when-provisioning-more-than-one-resource) I believe I got started on all this on a later version than 0.11.4, so it's always been there... i went through all this time to upgrade the syntax to be 0.12 compliant, but if it solves the multiple-provisioner issue, maybe it'd be worth it to revert it all. I logged my issue [on their github](https://github.com/hashicorp/terraform/issues/22722). – Max Cascone Sep 19 '19 at 16:58
2

Yes... that seems to have done it... 3 parallel `chef` provisions using `null_resource` went smoothly and successfully after downgrading and reverting syntax to `v.0.11.14`. I don't know how I didn't find this thread sooner. But at least it works now. – Max Cascone Sep 19 '19 at 20:33

score 1 · Answer 2 · answered Feb 12 '20 at 11:07

I used this triger in null_resource and it works perfectly for me. It also works when number of instances are increased and it do configuration on all instances.I am using terraform and openstack.

triggers= { instance_ids = join(",",openstack_compute_instance_v2.swarm-cluster-hosts[*].id) }

DarrenS · Answer 3 · 2020-06-25T08:45:06.330

0

Terraform 0.12.26 resolved similar issue for me (when using multiple file provisioners when deploying multiple VMs)

Hope this helps you: https://github.com/hashicorp/terraform/issues/22006

edited Jun 25 '20 at 08:45

answered Jun 25 '20 at 08:37

DarrenS

60
6

Terraform stucks when instance_count is more than 2 while using remote-exec provisioner

3 Answers3

Linked