2

I am trying to run queries on a presto cluster I have running on dataproc - via python (using presto from pyhive) on my local machine. But I can't seem to figure out the host URL. Does GCP dataproc even allow accessing the presto clusters remotely?

I tried using the URL on Presto's web UI, but that didn't work either. I also checked the docs about using Cloud Client Libraries for Python. Wasn't helpful either. https://cloud.google.com/dataproc/docs/tutorials/python-library-example

from pyhive import presto

query = '''select * FROM system.runtime.nodes'''

presto_conn = presto.Connection(host={host}, port=8060, username ={user})
presto_cursor = presto_conn.cursor()
presto_cursor.execute(query)

Error

ConnectionError: HTTPConnectionPool(host='https', port=80): Max retries exceeded with url: {url}
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb41c0c25d0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

Update I was able to manually create a VM on GCP compute, configure trino and setup firewall rules and load balancer to be able to access the cluster.

Gotta check if dataproc allows similar config.

  • What hostname are you using to connect to the presto cluster? – Gaurangi Saxena Sep 13 '21 at 22:03
  • I tried the url on Presto web UI. Docs: https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways#viewing_and_accessing_component_gateway_urls – Ashiq Korikkar Sep 14 '21 at 11:42
  • Component Gateway goes through Knox which performs url re-write, and there's also the inverting proxy, so I doubt it will serve your purpose. – cyxxy Sep 17 '21 at 22:33
  • Dataproc clusters are nothing but managed GCE instances, so all the GCE related firewall rules, etc. still apply, so what you did with your standalone trino VM on GCP compute (GCE), you should just do the same with the Dataproc cluster. There's no firewall management through Dataproc. – cyxxy Sep 17 '21 at 22:39

1 Answers1

2

Looks like Google firewall is blocking connections from the outside world.

How to fix

Quick and dirty solution

Just allow access to ports 8060 from your IP to the dataproc cluster.

This might not scale if you're on a public IP address but will allow you to develop.

It is a bad idea to expose "big data" services to the whole internet. You might get hacked, and Google will shut down the service.

Use a SSH tunnel

Create a small instance (one from the free-tier), expose the SSH port to the inernet, and use port-forwarding.

Your URLs won't be https://dataproc-cluster:8060..., but https://localhost:forwarded_port

This is easy to do and you can turn off that bastion vm when it's not needed.

Iñigo González
  • 3,735
  • 1
  • 11
  • 27