0

I tried to experiment PySpark. My first task was to write parquet file from rdd. But There is some issue with the error "An error occurred while calling o123.parquet.\n: ExitCodeException exitCode=-1073741515:" Here is my code :

spark = SparkSession.builder \
    .config('spark.driver.memory', '8g') \
    .config('spark.network.timeout', '1200s') \
    .config('spark.executor.memory', '8g') \
    .config("spark.sql.execution.arrow.pyspark.enabled", "false") \
    .appName('Log Transformation ') \
    .getOrCreate()

spark.conf.set("spark.default.parallelism", "8")

        
def read_zipfile(zip_file_path):
    with zipfile.ZipFile(zip_file_path) as z:
        with z.open(z.namelist()[0], 'r') as f:
            for line in f:
                yield line.decode('ansi').strip()

zipfile_path = "DataRef/data.zip"

rdd = spark.sparkContext.parallelize(read_zipfile(zipfile_path))

df = rdd.map(lambda r: [r.split(' ')[0], r.split(' ')[2], r.split(' ')[3] )  ]  )

# I can't event print that one
print(df.take(5))
data = df.toDF(['datetime', 'component', 'info'])

outRep = "DataRef/"

data.write.mode("overwrite").parquet(outRep+"dparquet.parquet")

I did some reaserch and I found that it can be a pyarrow incompatible. I already installed it. (version 11.0.0) My PySpark-3.0.0 and my python-3.10 Is there someone that could help me on that? Thanks

  • Hello, have you looked at https://stackoverflow.com/questions/45947375/why-does-starting-a-streaming-query-lead-to-exitcodeexception-exitcode-1073741 ? – ggagliano Feb 26 '23 at 11:42

0 Answers0