I am trying to read multiple parquet files from a s3 bucket containing several days worth of data.
s3 path : s3n://<s3path>/dt=*/*.snappy.parquet
Pyspark code to read data from multiple parquet files:
s="SELECT * FROM parquet.`s3n://<s3path>/dt=*/*.snappy.parquet`"
c = sqlContext.sql(s)
c.show(2)
Error is :
[Stage 3:> (0 + 3) / java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Working code to read a single file from a s3 bucket:
s2="SELECT * FROM parquet.`s3n://<s3path>/dt=2016-02-02/*.snappy.parquet`"
c2 = sqlContext.sql(s2)
c2.show(2)
+------+--------+---------------+-------------+-----------+
|CL1 |CL2 |CL3 |CL3 | r_CD |
+------+--------+---------------+-------------+------------
| 18 |YGh4c | 2016-02-02| 00:32:02| AC |
| 18 |YGh4c | 2016-02-02| 00:32:02| IC |
+------+--------+---------------+-------------+------------
I am able to read the files in the bucket individually using the same pyspark command. Why does scanning the whole bucket gives an error? What is the correct way to do this?