Error while exporting spark sql dataframe to csv

Question

I have referred the following links in order to understand how to export spark sql dataframe in python

My code:

df = sqlContext.createDataFrame(routeRDD, ['Consigner', 'AverageScore', 'Trips'])
df.select('Consigner', 'AverageScore', 'Trips').write.format('com.databricks.spark.csv').options(header='true').save('file:///opt/BIG-DATA/VisualCargo/output/top_consigner.csv')

I load the job with spark-submit passing the following jars on master url

spark-csv_2.11-1.5.0.jar, commons-csv-1.4.jar

I am getting the following error

df.select('Consigner', 'AverageScore', 'Trips').write.format('com.databricks.spark.csv').options(header='true').save('file:///opt/BIG-DATA/VisualCargo/output/top_consigner.csv')
      File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 332, in save
      File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
      File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco
      File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
    py4j.protocol.Py4JJavaError: An error occurred while calling o156.save.


py4j.protocol.Py4JJavaError: An error occurred while calling o156.save.
    : java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
        at com.databricks.spark.csv.util.CompressionCodecs$.<init>(CompressionCodecs.scala:29)
        at com.databricks.spark.csv.util.CompressionCodecs$.<clinit>(CompressionCodecs.scala)
        at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:198)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:170)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)

Looks like a jar conflict to me. Probably some dependency of the CSV writer. — LiMuBei, Dec 01 '16 at 09:58

T. Gawęda · Accepted Answer · 2016-12-01T15:26:58.550

3

Spark version 1.5.0-cdh5.5.1 is built with Scala 2.10 - default Scala version for Spark < 2.0. Your spark-csv is built with Scala 2.10 - spark-csv_2.11-1.5.0.jar.

Please update spark-csv to version with Scala 2.10 or update Spark to Scala 2.11. You will know Scala version by number after artifactId, i.e. spark-csv_2.10-1.5.0 will be for Scala 2.10

edited Dec 01 '16 at 15:26

answered Dec 01 '16 at 09:58

T. Gawęda

15,706
4
46
61

spark version: version 1.5.0-cdh5.5.1 – Hardik Gupta Dec 01 '16 at 10:01
@Hardik Yes, so it's Scala conflict. Please update (downgrade) spark-csv version to 2.10 version - http://search.maven.org/#artifactdetails%7Ccom.databricks%7Cspark-csv_2.11%7C1.5.0%7Cjar – T. Gawęda Dec 01 '16 at 10:03
Sir, thank you so much, downgrading my spark-csv jar to 2.10 works.. however now it creates multiple partitions in the folder, is there a way to control this, tried write.repartition(1).format("com.databricks.spark.csv")... but throws an error – Hardik Gupta Dec 01 '16 at 10:06
gives `AttributeError: 'DataFrameWriter' object has no attribute 'repartition'` – Hardik Gupta Dec 01 '16 at 10:08
`df.write.repartition(1).format("com.databricks.spark.csv").option("header", "true").save("file.csv")` , throws error `AttributeError: 'DataFrameWriter' object has no attribute 'repartition'` – Hardik Gupta Dec 01 '16 at 10:10
1

@Hardik order must be different - first `repartition` and then `write`, as I wrote in previous comment – T. Gawęda Dec 01 '16 at 10:11
just one question more, why it creates a folder as with part files and not just one file – Hardik Gupta Dec 01 '16 at 10:16
1

@Hardik That's how HadoopFileFormat works, Spark uses it to write files. In the folder there will be 1 file per each partition – T. Gawęda Dec 01 '16 at 10:19
1

@Hardik If you write it to normal storage, then you can use standard Java File API or `mv` command. In HDFS you can use `hdfs dfs -mv` or Hadoop File API – T. Gawęda Dec 01 '16 at 10:33

score 1 · Answer 2 · answered Aug 13 '19 at 20:15

I am running Spark on Windows and I faced the similar issue of not able to write to file(CSV or Parquet). Upon reading more into Spark website, I found the below error, and it is because of the winutils version that I am using. I changed it to 64bit and it worked. Hope this helps some one. Spark Log

Error while exporting spark sql dataframe to csv

2 Answers2