My use case is simple. I have an EMR cluster deployed through CDK running Presto using the AWS Data Catalog as the meta store. The cluster will be having just the default user running queries. By default, the master user is hadoop
, which I can use to connect to the cluster via JDBC and run queries. However, I can establish said connection without a password. I have read the Presto docs and they mention LDAP, Kerberos and file based authentication. I just want this to behave like, say, a MySQL database, where I have to pass both username AND password to connect. However, for the life of me, I can't find what configuration to set the password on. These are the settings I have so far:
{
classification: 'spark-hive-site',
configurationProperties: {
'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory',
},
},
{
classification: 'emrfs-site',
configurationProperties: {
'fs.s3.maxConnections': '5000',
'fs.s3.maxRetries': '200',
},
},
{
classification: 'presto-connector-hive',
configurationProperties: {
'hive.metastore.glue.datacatalog.enabled': 'true',
'hive.parquet.use-column-names': 'true',
'hive.max-partitions-per-writers': '7000000',
'hive.table-statistics-enabled': 'true',
'hive.metastore.glue.max-connections': '20',
'hive.metastore.glue.max-error-retries': '10',
'hive.s3.use-instance-credentials': 'true',
'hive.s3.max-error-retries': '200',
'hive.s3.max-client-retries': '100',
'hive.s3.max-connections': '5000',
},
},
Which setting can I use to set the hadoop
password? Kerberos, LDAP and file based seem overly complicated for this simple use case. Am I missing something obvious?
EDIT After reading countless pages of documentation and talking to AWS Support, i decided to move to Trino, but am running into more issues. These are the current configurations on my CDK deployment:
configurations: [
{
classification: 'spark-hive-site',
configurationProperties: {
'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory',
},
},
{
classification: 'emrfs-site',
configurationProperties: {
'fs.s3.maxConnections': '5000',
'fs.s3.maxRetries': '200',
},
},
{
classification: 'presto-connector-hive',
configurationProperties: {
'hive.metastore.glue.datacatalog.enabled': 'true',
'hive.parquet.use-column-names': 'true',
'hive.max-partitions-per-writers': '7000000',
'hive.table-statistics-enabled': 'true',
'hive.metastore.glue.max-connections': '20',
'hive.metastore.glue.max-error-retries': '10',
'hive.s3.use-instance-credentials': 'true',
'hive.s3.max-error-retries': '200',
'hive.s3.max-client-retries': '100',
'hive.s3.max-connections': '5000',
},
},
{
classification: 'trino-config',
configurationProperties: {
'query.max-memory-per-node': `${instanceMemory * 0.15}GB`, // 25% of a node
'query.max-total-memory-per-node': `${instanceMemory * 0.5}GB`, // 50% of a node
'query.max-memory': `${instanceMemory * 0.5 * coreInstanceGroupNodeCount}GB`, // 50% of the cluster
'query.max-total-memory': `${instanceMemory * 0.8 * coreInstanceGroupNodeCount}GB`, // 80% of the cluster
'query.low-memory-killer.policy': 'none',
'task.concurrency': vcpuCount.toString(),
'task.max-worker-threads': (vcpuCount * 4).toString(),
'http-server.authentication.type': 'PASSWORD',
'http-server.http.enabled': 'false',
'internal-communication.shared-secret': 'abcdefghijklnmopqrstuvwxyz',
'http-server.https.enabled': 'true',
'http-server.https.port': '8443',
'http-server.https.keystore.path': '/home/hadoop/fullCert.pem',
},
},
{
classification: 'trino-password-authenticator',
configurationProperties: {
'password-authenticator.name': 'file',
'file.password-file': '/home/hadoop/password.db',
'file.refresh-period': '5s',
'file.auth-token-cache.max-size': '1000',
},
},
],
I started here: https://trino.io/docs/current/security/tls.html
I am using this approach:
"Secure the Trino server directly. This requires you to obtain a valid certificate, and add it to the Trino coordinator’s configuration."
I have obtained an internal wildcard certificate from my company. This gets me:
- A certificate text
- A certificate chain
- A private key
From here: https://trino.io/docs/current/security/inspect-pem.html
It seems i need to plug those 3 files into one, for which I do:
-----BEGIN RSA PRIVATE KEY-----
Content of private key
-----END RSA PRIVATE KEY-----
-----BEGIN CERTIFICATE-----
Content of certificate text
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
First content of chain
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
Second content of chain
-----END CERTIFICATE-----
Then from a bootstrap action, i put the file in all nodes. That way i can fullfil this: https://trino.io/docs/current/security/tls.html#configure-the-coordinator with these configs:
'http-server.https.enabled': 'true',
'http-server.https.port': '8443',
'http-server.https.keystore.path': '/home/hadoop/fullCert.pem',
I know for sure the file is deployed to the nodes. THen I proceeded to do this: https://trino.io/docs/current/security/password-file.html
I also know that particular part works, because if I use the trino CLI directly on the master node with the wrong password, i get a credentials error.
Now, I'm currently stuck doing this:
[hadoop@ip-10-0-10-245 ~]$ trino-cli --server https://localhost:8446 --catalog awsdatacatalog --user hadoop --password --insecure
trino> select 1;
Query 20220701_201620_00001_9nksi failed: Insufficient active worker nodes. Waited 5.00m for at least 1 workers, but only 0 workers are active
From /var/log/trino/server.log
I see:
2022-07-01T21:30:12.966Z WARN http-client-node-manager-51 io.trino.metadata.RemoteNodeState Error fetching node state from https://ip-10-0-10-245.ec2.internal:8446/v1/info/state: Failed communicating with server: https://ip-10-0-10-245.ec2.internal:8446/v1/info/state
2022-07-01T21:30:13.902Z ERROR Announcer-0 io.airlift.discovery.client.Announcer Service announcement failed after 8.11ms. Next request will happen within 1000.00ms
2022-07-01T21:30:14.913Z ERROR Announcer-1 io.airlift.discovery.client.Announcer Service announcement failed after 10.35ms. Next request will happen within 1000.00ms
2022-07-01T21:30:15.921Z ERROR Announcer-3 io.airlift.discovery.client.Announcer Service announcement failed after 8.40ms. Next request will happen within 1000.00ms
2022-07-01T21:30:16.930Z ERROR Announcer-0 io.airlift.discovery.client.Announcer Service announcement failed after 8.59ms. Next request will happen within 1000.00ms
2022-07-01T21:30:17.938Z ERROR Announcer-1 io.airlift.discovery.client.Announcer Service announcement failed after 8.36ms. Next request will happen within 1000.00ms
Also with this:
[hadoop@ip-10-0-10-245 ~]$ trino-cli --server https://localhost:8446 --catalog awsdatacatalog --user hadoop --password
trino> select 1;
Error running command: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
trino>
Even though I am following this to upload the .pem files as assets to S3:
Am I wrong in saying that something this simple shouldn't be this complicated? I really will appreciate any help here.