3

I'm trying to use a regular EBS persistent storage volume in OpenShift Online Next Gen, and getting the following error when attempting to deploy:

    Unable to mount volumes for pod "production-5-vpxpw_instanttabletop(d784f054-a66b-11e7-a41e-0ab8769191d3)": timeout expired waiting for volumes to attach/mount for pod "instanttabletop"/"production-5-vpxpw". list of unattached/unmounted volumes=[volume-mondv]

Followed (after a while) by multiple instances of:

    Failed to attach volume "pvc-702876a2-a663-11e7-8348-0a69cdf75e6f" on node "ip-172-31-61-152.us-west-2.compute.internal" with: Error attaching EBS volume "vol-0fb5515c87914b844" to instance "i-08d3313801027fbc3": VolumeInUse: vol-0fb5515c87914b844 is already attached to an instance status code: 400, request id: 54dd24cc-6ab0-434d-85c3-f0f063e73099

The log for the deploy pod looks like this after it all times out:

    --> Scaling production-5 to 1
    --> Waiting up to 10m0s for pods in rc production-5 to become ready
    W1001 05:53:28.496345       1 reflector.go:323] github.com/openshift/origin/pkg/deploy/strategy/support/lifecycle.go:509: watch of *api.Pod ended with: too old resource version: 1455045195 (1455062250)
    error: update acceptor rejected production-5: pods for rc "production-5" took longer than 600 seconds to become ready

I thought at first that this might be related to this issue, but the only running pods are the deploy and the one that's trying to start, and I've switched to a Recreate strategy as suggested there, with no results.

Things did deploy and run normally the very first time, but since then I haven't been able to get it to deploy successfully.

Can anyone shed a little light on what I'm doing wrong here?

Update #1:

As an extra wrinkle, sometimes when I deploy it's taking what seems to be a long time to spin up deploy pods for this (I don't actually know how long it should take, but I get a warning suggesting things are going slowly, and my current deploy is sitting at 15+ minutes so far without having stood up).

In the deploy pod's event list, I'm seeing multiple instances each of Error syncing pod and Pod sandbox changed, it will be killed and re-created. as I wait, having touched nothing.

Doesn't happen every time, and I haven't discerned a pattern.

Not sure if this is even related, but seemed worth mentioning.

Update #2:

I tried deploying again this morning, and after canceling one deploy which was experiencing the issue described in my first update above, things stood up successfully.

I made no changes as far as I'm aware, so I'm baffled as to what the issue is or was here. I'll make a further update as to whether or not the issue recurs.

Update #3

After a bunch of further experimentation, I seem to be able to get my pod up and running regularly now. I didn't change anything about the configuration, so I assume this is something to do with sequencing, but even now it's not without some irregularities:

If I start a deploy, the existing running pod hangs indefinitely in the terminating state according to the console, and will stay that way until it's hard deleted (without waiting for it to close gracefully). Until that happens, it'll continue to produce the error described above (as you'd expect).

Frankly, this doesn't make sense to me, compared to the issues I was having last night - I had no other pods running when I was getting these errors before - but at least it's progress in some form.

I'm having some other issues once my server is actually up and running (requests not making it to the server, and issues trying to upgrade to a websocket connection), but those are almost certainly separate, so I'll save them for another question unless someone tells me they're actually related.

Update #4

OpenShift's ongoing issue listing hasn't changed, but things seem to be loading correctly now, so marking this as solved and moving on to other things.

For posterity, changing from Rolling to Recreate is key here, and even then you may need to manually kill the old pod if it gets stuck trying to shut down gracefully.

Dashiel N
  • 309
  • 2
  • 13

2 Answers2

4

You cannot use a persistent volume in OpenShift Online with an application which has the deployment strategy set as 'Rolling'. Edit the deployment configuration and make sure the deployment strategy is set to 'Recreate'.

You state you used 'replace'. If you set it to that by editing the JSON/YAML of the deployment configuration, the value change would have been discarded as 'replace' isn't a valid option.

Graham Dumpleton
  • 57,726
  • 6
  • 119
  • 134
  • Sorry about that. `Recreate` is what I had it set to. Edited the first post. – Dashiel N Oct 01 '17 at 15:45
  • Which OpenShift Online environment are you in? Starter, Pro? Which instance? Or your own OpenShift installation? – Graham Dumpleton Oct 01 '17 at 20:57
  • I'm on OpenShift Online Starter in US West 2. Note that some progress has been made, but honestly I'm more confused than ever about what's happening with this - issues that go away on their own make me worry they'll come back just as easily, heh. More info added to the first post. – Dashiel N Oct 01 '17 at 21:34
  • That environment has been having issues. You can see status at https://status.starter.openshift.com/ The web console also tells you top right if are any open issues with environments. – Graham Dumpleton Oct 01 '17 at 21:47
  • I saw the trouble message last night, but from the information listed there it sounded like it was only having issues creating new apps for people, so I didn't think it was relevant to the issues I've been having. Am I misunderstanding what the current issues are there? – Dashiel N Oct 01 '17 at 21:52
  • It is never that simple. For certain problems it can actually affect various aspects of deploying applications. So any change to an application, whether it being a new application or a redeployment. So I would say the description should probably be broader and say new applications and re-deployments as not too much difference really. – Graham Dumpleton Oct 01 '17 at 22:27
  • Gotcha. Sounds like I should wait for that to clear up then. Thanks for explaining! – Dashiel N Oct 01 '17 at 22:35
  • Quick follow-up question: should I wait for those issues to be cleared up before inquiring about other connectivity problems with the app (when the server is actually up and running), or are those likely to be unrelated? – Dashiel N Oct 02 '17 at 00:33
  • If you get an interactive shell in the container using ``oc rsh`` or using the web console, can you do ``curl $HOSTNAME:8080``? There have been some routing issues with ``us-east-1`` they are trying to sort out, but don't know of other clusters having same issue. The other issues in ``us-west-1`` could feasibly be affecting route setup as well if it relates to the datastore for the cluster configuration. – Graham Dumpleton Oct 02 '17 at 01:11
  • `curl $HOSTNAME:8080` in the terminal in the console spit out the source of the page, as expected, but the issue I've been having with page loads is more intermittent - 503s or just empty responses on different assets from one page load to the next. One time it loads fine, the next it doesn't. Added logging, and the requests for the missing assets never actually hit the server. (The websocket problem is another thing all together, unless it's just randomly not worked on a bunch of tries.) – Dashiel N Oct 02 '17 at 02:54
  • And you only have one replica? The issues with the platform may be affecting routing as well, or the separate routing issue that was affecting ``us-east-1`` may now be affecting ``us-west-1``. – Graham Dumpleton Oct 02 '17 at 03:10
  • Correct. Just the one replica. – Dashiel N Oct 02 '17 at 03:18
  • Going to mark this as the answer - while it seems like most of my difficulties were likely caused by the issues OpenShift has been experiencing, this would have been a key step, so might as well highlight it for future people with the same challenge. – Dashiel N Oct 07 '17 at 15:55
  • @DashielN I can confirm that OpenShift-3 is very unreliable (in comparison to the former OpenShift-2) in the sense of proper operation: even without any changes made in correct settings, pods can sporadically fail starting, mounting volumes, etc. Complete stops/restarts do not help. Only support can fix such issues but this may take quite a while. – Stan Dec 21 '17 at 11:11
  • @Stan If you are making those comments in relation to OpenShift Online, be aware that the environments have now been stabilised and you should not see the problems that were occurring previously. So your comment likely isn't applicable any more. If you haven't tried it for a while, you might want to try again. – Graham Dumpleton Dec 21 '17 at 11:19
  • @GrahamDumpleton I'm taking about problems I've experienced several times in the last months, last time a couple of days ago. – Stan Dec 22 '17 at 09:13
  • The only cluster which is noted as having any recent issues is us-west-1 which was 4 days ago. If you are having issues, and you are confident it isn't because you have failed to switch to Recreate deployment strategy, and aren't trying to scale an application with storage, you can report them at https://help.openshift.com/forms/community-contact.html – Graham Dumpleton Dec 22 '17 at 11:03
  • This slove my problem, but I'm confused that only OpenShift Online have the limit? Or k8s is not designed for deployment combined with pvc and rolling update? – Tokenyet Dec 31 '19 at 14:57
  • 1
    Any multi node Kubernetes cluster will have the issue with rolling deployments if all the setup provides is ReadWriteOnce (RWO) persistent volume type. You need to have ReadWriteMany (RWX) available to be able to do rolling deployments with a persistent volume attached. – Graham Dumpleton Dec 31 '19 at 21:04
0

The error is clearly indicating that the volume is already attached to some other running instance.

VolumeInUse: vol-0fb5515c87914b844 is already attached to an instance status code: 400, request id: 54dd24cc-6ab0-434d-85c3-f0f063e73099

You must do a cleanup by either:

1) De-attach the volume from running instance and reattaching. Be careful about data, because EBS volume' lifecycle is limited to pod lifecycle.

2) Before creating another deployment for a new build, make sure that the earlier running container instance is killed (by deleting the container instance).

Gyanendra Dwivedi
  • 5,511
  • 2
  • 27
  • 53
  • That was the first thing I tried - clearing my pod list entirely. There are no other running pods for it to conflict with, unless the deploy pod is somehow the issue (and it doesn't mention the volume in question at all in its information, so I don't see why it would be). Is there something else I should be killing that I'm not thinking of? – Dashiel N Oct 01 '17 at 06:45
  • Check the storage menu to see with which instance your storage is linked to. It should give you a clue on what you need to de-attach it from? – Gyanendra Dwivedi Oct 01 '17 at 06:47
  • Sorry to be dense, but when you say "instance" here, what do you mean? Nothing in the console is actually called that, and I want to be sure I'm actually understanding you correctly. – Dashiel N Oct 01 '17 at 06:51
  • The only information I see in the storage information is the ID of the volume to which it is bound. Don't see anything there about attachment to a particular pod. – Dashiel N Oct 01 '17 at 06:52
  • `instances` meant openshift `deployments`. Sorry, the word is more related to AWS on which openshift is built. In the storage section, note down the name field. Go to deployments, click on each of them. Check configuration tab, read the sections..wherever it is applicable there would be a section mentioning about storage volume being used. The name of the volume should be same as the name noted earlier. – Gyanendra Dwivedi Oct 01 '17 at 07:03
  • Thanks for clarifying - to clarify further, do you mean an item from the list of named deployments, or from the list of numbered deployment attempts (which is also just labeled deployments, for added confusion value) within a named deployment? I have only one of the former, and clearing the list of the later doesn't resolve the issue (when the most recent is deleted, it just auto-deploys again, with the issues described). – Dashiel N Oct 01 '17 at 07:10
  • You do not need to go into numbered deployment. Just click on deployment and there should be a history tab, beside that there is a configuration tab. Please check that configuration and note down that how many of the deployments are having a storage linked with the same name as noted earlier. You must be using same volume at two different deployments or reaching to quota limit. – Gyanendra Dwivedi Oct 01 '17 at 07:16
  • I have only one deployment (the same one I'm trying to deploy), and it's attached to that storage volume. There were a couple of previous deployments along the way, but I deleted them in my own attempts to trouble-shoot this. Possible they some how didn't release their attachment when they were deleted? – Dashiel N Oct 01 '17 at 07:22
  • Try deleting this deployment completely such that there is no deployment left, then click on top `add to project` link and select 2nd option (deploy image). In the menu default radio button selection and pick you deployment. If it deploys successfully, it means none of your previous deployment has locked the storage. – Gyanendra Dwivedi Oct 01 '17 at 07:43
  • No dice with that approach either looks like. (And let me say, thanks for sticking with me on this!) – Dashiel N Oct 01 '17 at 15:38