I am new to EMR and Bigdata,
We have an EMR step and that was working fine till last month, currently I am getting the below error.
--- Logging error ---
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/src.zip/src/source/Data_Extraction.py", line 59, in process_job_description
df_job_desc = spark.read.schema(schema_jd).option('multiline',"true").json(self.filepath)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o115.json.
: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:421)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:654)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:625)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:473)
at
these json files are presents in S3, I downloaded some of the files to reproduce the issue in local, when I have smaller set of data, it is working fine, but in EMR im unable to reproduce.
also, I checked Application details of EMR for this step.
it says undefined
status for status with the below details.
Details:org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3285)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:750)
spark session creation
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark_builder = (
SparkSession\
.builder\
.config(conf=SparkConf())\
.appName("test"))
spark = spark_builder.getOrCreate()
I am not sure, what went wrong suddenly with this step, please help.