4

I'm trying to create a basic (I used default values) a dataproc cluster in a GCP project, The VMs are created but the cluster still in the Provisioning State forever until timeout

  • I tried both with the console and also with the command line.
  • I tried different image versions (2.0-debian, 2.0-ubuntu, 1.5-debian, 1.5-ubuntu)
  • Non components are selected ( it will be used for spark Jobs)

in all those cases I have the following error ( found SSHing the master on /var/log/google-dataproc-agent.0.log)

Network is unreachable: dataproccontrol-europe-west1.googleapis.com/2a00:1450:400c:c04:0:0:0:5f:443

The full error trace :

ul 24, 2021 11:02:53 AM com.google.cloud.hadoop.services.repackaged.com.google.cloud.hadoop.util.ResilientOperation nextSleep INFO: Transient exception caught. Sleeping for 1120, then retrying.
com.google.cloud.hadoop.services.repackaged.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 29.974635818s. [buffered_nanos=30006131805, waiting_for_connection]
        at com.google.cloud.hadoop.services.repackaged.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:244)
        at com.google.cloud.hadoop.services.repackaged.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:225)
        at com.google.cloud.hadoop.services.repackaged.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:142)
        at com.google.cloud.dataproc.control.v1.AgentServiceGrpc$AgentServiceBlockingStub.createAgent(AgentServiceGrpc.java:735)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater$1.call(AgentApiAsyncUpdater.java:238)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater$1.call(AgentApiAsyncUpdater.java:235)
        at com.google.cloud.hadoop.services.repackaged.com.google.cloud.hadoop.util.ResilientOperation.retry(ResilientOperation.java:67)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater.executeWithBackoff(AgentApiAsyncUpdater.java:345)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater.createAgent(AgentApiAsyncUpdater.java:234)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater.getOrCreateAgent(AgentApiAsyncUpdater.java:203)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater.run(AgentApiAsyncUpdater.java:183)
        at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:679)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Jul 24, 2021 11:03:23 AM com.google.cloud.hadoop.services.repackaged.com.google.cloud.hadoop.util.ResilientOperation nextSleep INFO: Transient exception caught. Sleeping for 1958, then retrying.
com.google.cloud.hadoop.services.repackaged.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
        at com.google.cloud.hadoop.services.repackaged.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:244)
        at com.google.cloud.hadoop.services.repackaged.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:225)
        at com.google.cloud.hadoop.services.repackaged.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:142)
        at com.google.cloud.dataproc.control.v1.AgentServiceGrpc$AgentServiceBlockingStub.createAgent(AgentServiceGrpc.java:735)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater$1.call(AgentApiAsyncUpdater.java:238)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater$1.call(AgentApiAsyncUpdater.java:235)
        at com.google.cloud.hadoop.services.repackaged.com.google.cloud.hadoop.util.ResilientOperation.retry(ResilientOperation.java:67)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater.executeWithBackoff(AgentApiAsyncUpdater.java:345)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater.createAgent(AgentApiAsyncUpdater.java:234)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater.getOrCreateAgent(AgentApiAsyncUpdater.java:203)
        at com.google.cloud.hadoop.services.agent.protocol.AgentApiAsyncUpdater.run(AgentApiAsyncUpdater.java:183)
        at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:679)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.cloud.hadoop.services.repackaged.io.netty.channel.AbstractChannel$AnnotatedSocketException: Network is unreachable: dataproccontrol-europe-west1.googleapis.com/2a00:1450:400c:c04:0:0:0:5f:443
Caused by: java.net.SocketException: Network is unreachable
        at sun.nio.ch.Net.connect0(Native Method)
        at sun.nio.ch.Net.connect(Net.java:482)
        at sun.nio.ch.Net.connect(Net.java:474)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:647)
        at com.google.cloud.hadoop.services.repackaged.io.netty.util.internal.SocketUtils$3.run(SocketUtils.java:91)
        at com.google.cloud.hadoop.services.repackaged.io.netty.util.internal.SocketUtils$3.run(SocketUtils.java:88)
        at java.security.AccessController.doPrivileged(Native Method)
        at com.google.cloud.hadoop.services.repackaged.io.netty.util.internal.SocketUtils.connect(SocketUtils.java:88)
        at com.google.cloud.hadoop.services.repackaged.io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:315)
        at com.google.cloud.hadoop.services.repackaged.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:248)
        at com.google.cloud.hadoop.services.repackaged.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342)
        at com.google.cloud.hadoop.services.repackaged.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548)
        at com.google.cloud.hadoop.services.repackaged.io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:533)
        at com.google.cloud.hadoop.services.repackaged.io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:54)
        at com.google.cloud.hadoop.services.repackaged.io.grpc.netty.WriteBufferingAndExceptionHandler.connect(WriteBufferingAndExceptionHandler.java:150)
        at com.google.cloud.hadoop.services.repackaged.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548)
        at com.google.cloud.hadoop.services.repackaged.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61)
        at com.google.cloud.hadoop.services.repackaged.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538)
        at com.google.cloud.hadoop.services.repackaged.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at com.google.cloud.hadoop.services.repackaged.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at com.google.cloud.hadoop.services.repackaged.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at com.google.cloud.hadoop.services.repackaged.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at com.google.cloud.hadoop.services.repackaged.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.google.cloud.hadoop.services.repackaged.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)

Any help please !

thank you in advance

Edit :My firewalls & VPC enter image description here enter image description here enter image description here enter image description here enter image description here

Cluster configuration : enter image description here

Dagang
  • 24,586
  • 26
  • 88
  • 133
med
  • 323
  • 4
  • 11
  • 1
    Check your firewall rules. – Dagang Jul 24 '21 at 16:13
  • 1
    https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network – Dagang Jul 24 '21 at 16:35
  • Thanks @Dagang I have firewalls permissive (see my original post) , do I miss something? – med Jul 24 '21 at 16:40
  • 1
    I need to know more about your other networking settings. Is it a subnet-mode network? Did you enable Private Google Access? Is it an internal-ip-only cluster? – Dagang Jul 24 '21 at 16:55
  • I added my networking settings and cluster config, BTW I tried both internal-ip-only = false and True but the same problem, thank you in advance – med Jul 24 '21 at 17:06
  • Could you also show the routes of the network? – Dagang Jul 24 '21 at 17:17
  • I added the routing as well – med Jul 24 '21 at 17:27
  • 1
    Seems you are missing a route to the internet. By default it should have a route with `--next-hop-gateway=default-internet-gateway`. – Dagang Jul 24 '21 at 17:40
  • Indeed I fixed the problem, I added that route and it is fixed "Default route to the Internet." , You can add your comment as an answer, I will approved. Thank you @Dagang your saved my day ;) – med Jul 24 '21 at 17:40
  • 1
    Glad it is fixed! – Dagang Jul 24 '21 at 17:42
  • I added an answer, please accept if it fixes the problem. Thanks! – Dagang Jul 24 '21 at 17:45

1 Answers1

3

Based on the error message Network is unreachable: dataproccontrol-europe-west1.googleapis.com/2a00:1450:400c:c04:0:0:0:5f:443 and your network settings, seems you are missing a route to the internet.

You can fix the problem by adding a route to 0.0.0.0/0 for IPv4 and ::/0 for IPv6 with --next-hop-gateway=default-internet-gateway, see more details in this doc. The route should have been automatically created for a new VPC network but I guess you deleted it, see this doc.

The reason for the need of the route is that the Dataproc agent on the VM needs to access the Dataproc control API to get jobs and report status. The API domain name dataproccontrol-<region>.googleapis.com is resolved to an external IP, so the VMs need to have a route to the internet (or the IP ranges), but when Private Google Access is enabled, the traffic won't leave Google data centers. The recommendation is to always have a route to the internet, and use firewall rules for more granular access control. Also note that VMs without external IPs are not able to access the internet by default, even if routes and firewall rules allow it, see this doc if you want a solution. BTW, You can use the Connectivity Test tool for troubleshooting.

Dagang
  • 24,586
  • 26
  • 88
  • 133
  • Thank you @Dagang, this is what I'am messing. Just I'm wondering why access to Internet is needed for dataproc ? is it for downloading dependencies while building the cluster ? – med Jul 24 '21 at 17:52
  • BTW I also tried Connectivity Test , it is a greet tool and I recommend it highly to everybody to diagnose networking issues. In my case I was testing the connectivity between my cluster VMs, never though that internet access is needed ! – med Jul 24 '21 at 17:55
  • 1
    The agent on the VM needs to access Google APIs to get jobs and report status. The API domain names are resolved to external IPs, so the VMs need to have a route to the internet (or the IP range if you know the range). But when Private Google Access is enabled, the actual traffic reaches the API without leaving Google data center. – Dagang Jul 24 '21 at 17:58
  • 1
    BTW, the recommendation is that you always have the route to the internet, but use firewall to block unwanted ingress or outgress access. If your VMs are internal IP only, by default they are not able to reach the internet even if there is a route. – Dagang Jul 24 '21 at 18:02
  • On ingress firewall rules, @med it looks like you have an overly permissive 0.0.0.0/0 "dataproc-vpc-allow-internal"; based on the name you probably intended this to only allow VM-to-VM traffic, which is what Dataproc would need. You should fix that to 10.0.0.0/8 at least as soon as possible. – Dennis Huo Jul 24 '21 at 23:13
  • More precisely, for any rule that opens to all TCP+UDP ports intending to allow VM-to-VM traffic on the shared network, you'll want the ingress IP ranges to match the IP ranges assigned to your subnet. For your subnet that you screenshotted that would actually mean two ranges (so given your config, 10.0.0.0/8 wouldn't be enough): 10.1.0.0/16, 172.1.0.0/16 – Dennis Huo Jul 24 '21 at 23:20
  • Hi @DennisHuo, Yes you are right, I have just made my dataproc-vpc-allow-internal more permissive to diagnose my previous issue. But now I assigned a tag 'dataproc-vm' to my dataproc VMs, and in the firewall side the traffic is allowed only for VMs with tag 'daparoc-vm' – med Jul 26 '21 at 11:25
  • Thanks for checking, @med - FYI, while tags do limit which VMs a firewall rule applies to, tags do not change the "source IP ranges" behavior. For example, you might have a firewall rule opening TCP:22 to 0.0.0.0/0 with a tag also limiting to 'dataproc-vm', but that doesn't prevent you from being able to SSH from your home internet into that VM if the VM has an external IP address (e.g. the 0.0.0.0/0 is what allowed your home internet to connect to the SSH port, regardless of "tag"). I'd generally recommend double-checking expectations on any firewall rules with 0.0.0.0/0 all-ports – Dennis Huo Jul 27 '21 at 19:19
  • Thanks @DennisHuo for the information, the VMs have only private IPs, but I could SSH from Cloud Shell. BTW, is there a way to have a firewall rule allowing only SSH traffic from Cloud Shell only. – med Jul 28 '21 at 14:19
  • Ah good to know, thanks! Unfortunately there doesn't appear to be any way to set firewall rules based on Cloud Shell IP addresses: https://stackoverflow.com/questions/57024031/gcp-open-firewall-only-to-cloud-shell but if the VMs are private-IP-only anyways, then at least you don't have to worry about SSH from outside GCP either. – Dennis Huo Jul 29 '21 at 17:42
  • Thank you @DennisHuo – med Aug 04 '21 at 09:43