0

I'm reading a csv file and turning it into parket:

read:

variable = spark.read.csv( r'C:\Users\xxxxx.xxxx\Desktop\archive\test.csv', sep=';', inferSchema=True, header=True)

sending for parquet:

variable .write.parquet( path= r'C:\Users\\xxxxx.xxxx\Desktop\archive\parquet\new.parquet'
#OR-  r'C:\Users\xxxxx.xxxx\Desktop\archive\parquet\'
mode='overwrite',

Both give the same error:

Py4JJavaError: An error occurred while calling o186.parquet.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:651)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:288)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98
Job aborted due to stage failure: 
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 63) 
(XXXXX-xxxx.xxx.local executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\xxxx.xxxx\Desktop\xxx.parquet_temporary\0_temporary\attempt_202304111306381850890757855117295_0014_m_000001_63\part-00001-1ea07aa8-0302-492c-993c-86ce32f575d8-c000.snappy.parquet

in googlecolab it works perfectly, and I don't change anything in the code.

I just want to know why my windows 10 machine doesn't work, and what can i do to fix it

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Guilherme
  • 1
  • 1
  • 1
    One segment of your path has two slashes while others do not? Also, Spark will always write a directory, not a single file – OneCricketeer Apr 11 '23 at 13:08
  • Which version of Java are you using ? 32 or 64 bit ? – Abdennacer Lachiheb Apr 11 '23 at 14:26
  • @AbdennacerLachiheb, 64bit. – Guilherme Apr 11 '23 at 14:45
  • @OneCricketeer I accidentally put it in the explanation, consider a \ – Guilherme Apr 11 '23 at 14:47
  • also consider that in colab it worked perfectly – Guilherme Apr 11 '23 at 14:48
  • Which version of java are you using ? try java 8 – Abdennacer Lachiheb Apr 11 '23 at 14:53
  • @AbdennacerLachiheb yes, os.environ["JAVA_HOME"] = r"C:\Program Files\Java\jre1.8.0_361" – Guilherme Apr 11 '23 at 14:56
  • "Job aborted" is not your actual error. Please find the logs of the failed executor in Spark UI. Java/OS versions should not matter until you see exact error messages related to them (Besides, Colab uses Linux. On Windows, you need `winutils.exe`, `hadoop.dll` files, for example, which are not included with Spark) – OneCricketeer Apr 11 '23 at 15:37
  • @OneCricketeer Yes, I downloaded winutils.exe from github, and I'm using hadoop 2.7, but it still doesn't run on windows. – Guilherme Apr 11 '23 at 16:02
  • @OneCricketeer, error in log of spark: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 63) (XXXXX-xxxx.xxx.local executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\xxxx.xxxx\Desktop\xxx.parquet\_temporary\0\_temporary\attempt_202304111306381850890757855117295_0014_m_000001_63\part-00001-1ea07aa8-0302-492c-993c-86ce32f575d8-c000.snappy.parquet – Guilherme Apr 11 '23 at 16:07
  • Looks like it tried to `chmod 0644` a file (using winutils), but then something was null in that command? But it did create a parquet file. The latest Pyspark uses Hadoop 3.x libraries, which will not work with Hadoop 2.7 resources, by the way. – OneCricketeer Apr 11 '23 at 23:00
  • Does this answer your question? [(null) entry in command string exception in saveAsTextFile() on Pyspark](https://stackoverflow.com/questions/40764807/null-entry-in-command-string-exception-in-saveastextfile-on-pyspark) – samkart Apr 12 '23 at 07:58
  • @OneCricketeer even changing the versions, as indicated, this error follows: Py4JJavaError: An error occurred while calling o45.parquet. : org.apache.spark.SparkException: Job aborted. """With Windows it's quite complicated""" – Guilherme Apr 12 '23 at 12:28
  • @samkart I followed this process previously, and it didn't work. I don't know if there's a solution hahahaha – Guilherme Apr 12 '23 at 12:30

0 Answers0