I tried to experiment PySpark. My first task was to write parquet file from rdd. But There is some issue with the error "An error occurred while calling o123.parquet.\n: ExitCodeException exitCode=-1073741515:" Here is my code :
spark = SparkSession.builder \
.config('spark.driver.memory', '8g') \
.config('spark.network.timeout', '1200s') \
.config('spark.executor.memory', '8g') \
.config("spark.sql.execution.arrow.pyspark.enabled", "false") \
.appName('Log Transformation ') \
.getOrCreate()
spark.conf.set("spark.default.parallelism", "8")
def read_zipfile(zip_file_path):
with zipfile.ZipFile(zip_file_path) as z:
with z.open(z.namelist()[0], 'r') as f:
for line in f:
yield line.decode('ansi').strip()
zipfile_path = "DataRef/data.zip"
rdd = spark.sparkContext.parallelize(read_zipfile(zipfile_path))
df = rdd.map(lambda r: [r.split(' ')[0], r.split(' ')[2], r.split(' ')[3] ) ] )
# I can't event print that one
print(df.take(5))
data = df.toDF(['datetime', 'component', 'info'])
outRep = "DataRef/"
data.write.mode("overwrite").parquet(outRep+"dparquet.parquet")
I did some reaserch and I found that it can be a pyarrow incompatible. I already installed it. (version 11.0.0) My PySpark-3.0.0 and my python-3.10 Is there someone that could help me on that? Thanks