1

My use case is simple. I have an EMR cluster deployed through CDK running Presto using the AWS Data Catalog as the meta store. The cluster will be having just the default user running queries. By default, the master user is hadoop, which I can use to connect to the cluster via JDBC and run queries. However, I can establish said connection without a password. I have read the Presto docs and they mention LDAP, Kerberos and file based authentication. I just want this to behave like, say, a MySQL database, where I have to pass both username AND password to connect. However, for the life of me, I can't find what configuration to set the password on. These are the settings I have so far:

    {
        classification: 'spark-hive-site',
        configurationProperties: {
            'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory',
        },
    },
    {
        classification: 'emrfs-site',
        configurationProperties: {
            'fs.s3.maxConnections': '5000',
            'fs.s3.maxRetries': '200',
        },
    },
    {
        classification: 'presto-connector-hive',
        configurationProperties: {
            'hive.metastore.glue.datacatalog.enabled': 'true',
            'hive.parquet.use-column-names': 'true',
            'hive.max-partitions-per-writers': '7000000',
            'hive.table-statistics-enabled': 'true',
            'hive.metastore.glue.max-connections': '20',
            'hive.metastore.glue.max-error-retries': '10',
            'hive.s3.use-instance-credentials': 'true',
            'hive.s3.max-error-retries': '200',
            'hive.s3.max-client-retries': '100',
            'hive.s3.max-connections': '5000',
        },
    },

Which setting can I use to set the hadoop password? Kerberos, LDAP and file based seem overly complicated for this simple use case. Am I missing something obvious?

EDIT After reading countless pages of documentation and talking to AWS Support, i decided to move to Trino, but am running into more issues. These are the current configurations on my CDK deployment:

configurations: [
    {
        classification: 'spark-hive-site',
        configurationProperties: {
            'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory',
        },
    },
    {
        classification: 'emrfs-site',
        configurationProperties: {
            'fs.s3.maxConnections': '5000',
            'fs.s3.maxRetries': '200',
        },
    },
    {
        classification: 'presto-connector-hive',
        configurationProperties: {
            'hive.metastore.glue.datacatalog.enabled': 'true',
            'hive.parquet.use-column-names': 'true',
            'hive.max-partitions-per-writers': '7000000',
            'hive.table-statistics-enabled': 'true',
            'hive.metastore.glue.max-connections': '20',
            'hive.metastore.glue.max-error-retries': '10',
            'hive.s3.use-instance-credentials': 'true',
            'hive.s3.max-error-retries': '200',
            'hive.s3.max-client-retries': '100',
            'hive.s3.max-connections': '5000',
        },
    },
    {
        classification: 'trino-config',
        configurationProperties: {
            'query.max-memory-per-node': `${instanceMemory * 0.15}GB`, // 25% of a node
            'query.max-total-memory-per-node': `${instanceMemory * 0.5}GB`, // 50% of a node
            'query.max-memory': `${instanceMemory * 0.5 * coreInstanceGroupNodeCount}GB`, // 50% of the cluster
            'query.max-total-memory': `${instanceMemory * 0.8 * coreInstanceGroupNodeCount}GB`, // 80% of the cluster
            'query.low-memory-killer.policy': 'none',
            'task.concurrency': vcpuCount.toString(),
            'task.max-worker-threads': (vcpuCount * 4).toString(),
            'http-server.authentication.type': 'PASSWORD',
            'http-server.http.enabled': 'false',
            'internal-communication.shared-secret': 'abcdefghijklnmopqrstuvwxyz',
            'http-server.https.enabled': 'true',
            'http-server.https.port': '8443',
            'http-server.https.keystore.path': '/home/hadoop/fullCert.pem',
        },
    },
    {
        classification: 'trino-password-authenticator',
        configurationProperties: {
            'password-authenticator.name': 'file',
            'file.password-file': '/home/hadoop/password.db',
            'file.refresh-period': '5s',
            'file.auth-token-cache.max-size': '1000',
        },
    },
],

I started here: https://trino.io/docs/current/security/tls.html

I am using this approach:

"Secure the Trino server directly. This requires you to obtain a valid certificate, and add it to the Trino coordinator’s configuration."

I have obtained an internal wildcard certificate from my company. This gets me:

  • A certificate text
  • A certificate chain
  • A private key

From here: https://trino.io/docs/current/security/inspect-pem.html

It seems i need to plug those 3 files into one, for which I do:

-----BEGIN RSA PRIVATE KEY-----
Content of private key
-----END RSA PRIVATE KEY-----
-----BEGIN CERTIFICATE-----
Content of certificate text
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
First content of chain
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
Second content of chain
-----END CERTIFICATE-----

Then from a bootstrap action, i put the file in all nodes. That way i can fullfil this: https://trino.io/docs/current/security/tls.html#configure-the-coordinator with these configs:

'http-server.https.enabled': 'true',
                'http-server.https.port': '8443',
                'http-server.https.keystore.path': '/home/hadoop/fullCert.pem',

I know for sure the file is deployed to the nodes. THen I proceeded to do this: https://trino.io/docs/current/security/password-file.html

I also know that particular part works, because if I use the trino CLI directly on the master node with the wrong password, i get a credentials error.

Now, I'm currently stuck doing this:

[hadoop@ip-10-0-10-245 ~]$ trino-cli --server https://localhost:8446 --catalog awsdatacatalog --user hadoop --password --insecure
trino> select 1;
Query 20220701_201620_00001_9nksi failed: Insufficient active worker nodes. Waited 5.00m for at least 1 workers, but only 0 workers are active

From /var/log/trino/server.log I see:

2022-07-01T21:30:12.966Z        WARN    http-client-node-manager-51     io.trino.metadata.RemoteNodeState       Error fetching node state from https://ip-10-0-10-245.ec2.internal:8446/v1/info/state: Failed communicating with server: https://ip-10-0-10-245.ec2.internal:8446/v1/info/state
2022-07-01T21:30:13.902Z        ERROR   Announcer-0     io.airlift.discovery.client.Announcer   Service announcement failed after 8.11ms. Next request will happen within 1000.00ms
2022-07-01T21:30:14.913Z        ERROR   Announcer-1     io.airlift.discovery.client.Announcer   Service announcement failed after 10.35ms. Next request will happen within 1000.00ms
2022-07-01T21:30:15.921Z        ERROR   Announcer-3     io.airlift.discovery.client.Announcer   Service announcement failed after 8.40ms. Next request will happen within 1000.00ms
2022-07-01T21:30:16.930Z        ERROR   Announcer-0     io.airlift.discovery.client.Announcer   Service announcement failed after 8.59ms. Next request will happen within 1000.00ms
2022-07-01T21:30:17.938Z        ERROR   Announcer-1     io.airlift.discovery.client.Announcer   Service announcement failed after 8.36ms. Next request will happen within 1000.00ms

Also with this:

[hadoop@ip-10-0-10-245 ~]$ trino-cli --server https://localhost:8446 --catalog awsdatacatalog --user hadoop --password
trino> select 1;
Error running command: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
trino> 

Even though I am following this to upload the .pem files as assets to S3:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-encryption-enable.html#emr-encryption-certificates

Am I wrong in saying that something this simple shouldn't be this complicated? I really will appreciate any help here.

rodrigocf
  • 1,951
  • 13
  • 39
  • 62

1 Answers1

0

Based on the message you are getting from Trino, Insufficient active worker nodes, the authentication system is working, and you are now having problems with secure internal communication. Specifically, the machines are having problems talking to each other. I would start by disabling internal TLS, verifying that everything is working, and only then work on enabling that (assuming you need this in your environment). To disable TLS, use:

internal-communication.shared-secret=<secret>
internal-communication.https.required=false
discovery.uri=http://<coordinator ip address>:<http port>

Then restar all your machines. You should not see Service announcement failed. There might be a couple of these when the machines are starting up, but once they establish communication the error messages should stop.

Dain Sundstrom
  • 2,699
  • 15
  • 14
  • I have a question about this, if on EMR, doesn't `discovery.uri` become a chicken or the egg problem? Like, I can't really know the coordinator IP until it is deployed, right? Also, and I'm revealing how new I am to this here, what is the default value of the HTTP port if I don't have it set? is it 8446 as per the logs I have above? – rodrigocf Jul 06 '22 at 21:28