Amazon EMR pyspark unable to read a JSON file

Question

I am new to EMR and Bigdata,

We have an EMR step and that was working fine till last month, currently I am getting the below error.

--- Logging error ---
Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/src.zip/src/source/Data_Extraction.py", line 59, in process_job_description
    df_job_desc = spark.read.schema(schema_jd).option('multiline',"true").json(self.filepath)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o115.json.
: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
    at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:421)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:654)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:625)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:473)
    at

these json files are presents in S3, I downloaded some of the files to reproduce the issue in local, when I have smaller set of data, it is working fine, but in EMR im unable to reproduce.

also, I checked Application details of EMR for this step.

it says undefined status for status with the below details.

Details:org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3285)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:750)

spark session creation

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark_builder = (
        SparkSession\
        .builder\
        .config(conf=SparkConf())\
        .appName("test"))
spark = spark_builder.getOrCreate()

I am not sure, what went wrong suddenly with this step, please help.

Based on the errors , the problem , is your connectivity with S3 , example for s3 read can be found - https://stackoverflow.com/questions/69734268/java-io-ioexception-no-filesystem-for-scheme-s3 — Vaebhav, Aug 18 '22 at 05:29
but my other steps are working fine which are also reading json from s3, I am using same procedure — Pyd, Aug 18 '22 at 05:31
thanks, but we are not using tokens. we create spark session by spark context, and emr has permission to read s3, so it automatically reads — Pyd, Aug 18 '22 at 07:16
Can you update the question with how are you establishing the S3 connectivity — Vaebhav, Aug 18 '22 at 07:59
While I haven't run across this particular issue, I do have EMR jobs that ran fine for years until "serverless EMR" was released. We've seen a general degradation of EMR with tons of inexplicable failures since that release. We're abandoning EMR and going with our own kubernetes scripts as a result. So far it's faster, cheaper, and far more reliable. — Tim Gautier, Aug 18 '22 at 17:52

score 1 · Answer 1 · answered Aug 24 '22 at 07:13

Your error indicates a failed security protocol as suggested by various results from googling all pointing to throttling/rejecting incoming TLS connections. Given that this occurs in the context of a backoff strategy.

You can further try these suggestions for retrying with exponential backoff strategy - here and limiting your requests by utilising the AMID.

Additionally you can check you DNS quotas to check if that is not limiting anything or exhausting your quota

Further add your Application Environment to further check if an outdated version might be causing this-

EMR Release version
Spark Versions
AWS SDK Version
AMI [ Amazon Linux Machine Images ] versions
Java & JVM Details
Hadoop Details

Recommended Environment would be to use - AMI 2.x , EMR - 5.3x and the compatible SDKs towards the same [ Preferably AWSS3JavaClient 1.11x ]

More info about EMR releases can be found here

Additionally provide a clear snippet , how are you exactly reading your json files from S3 , are you doing it in an iterative fashion , 1 after the other or in bulk or batches

References used -

nferreira78 · Answer 2 · 2022-08-24T13:22:52.727

0

From your error message: ...SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake, seems like you've got a security protocol that is not accepted by the host or the error indicates that the connection was closed on the service side before the SDK was able to perform handshake. You should add a try/except block and add some delay between retrys, to handle those

errors = 0
while errors < 5:
    try:
        df_job_desc = spark.read.schema(schema_jd).option('multiline',"true").json(self.filepath)
        errors = 0
    except:
        time.sleep(1)
        errors += 1
        pass

edited Aug 24 '22 at 13:22

answered Aug 22 '22 at 16:27

nferreira78

1,013
4
17

Can you provide example – Pyd Aug 22 '22 at 17:55
just added edited to add an example – nferreira78 Aug 23 '22 at 09:26
the variable `read` is meant to change once the read operation is successful (hence why it is `while not read`). However, this is an example, you need to apply it and test it on your own thing. – nferreira78 Aug 23 '22 at 09:41
yes I applied to my solution, its a forever ending loop, everytime the same error throws, the step was running for more than 5 hours, so I stopped now – Pyd Aug 23 '22 at 15:58
like I said, it was only an example, but I have edited to add a counter and retry a maximum of 5 times before exiting the while loop. You may also increase the delay, you may be blocked from making too many requests – nferreira78 Aug 24 '22 at 13:23

Amazon EMR pyspark unable to read a JSON file

2 Answers2