Questions tagged [spark3]

To be used for Apache Spark 3.x

Tag is for all related to Apache Spark 3.0.0 and higher.

This tag is separate from apache-spark tag as this version has breaking changes.

Apache Spark is a unified analytics engine for large-scale data processing.

80 questions
30
votes
6 answers

to_date fails to parse date in Spark 3.0

I am trying to parse date using to_date() but I get the following exception. SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '12/1/2010 8:26' in the new parser. You can set…
noobie-php
  • 6,817
  • 15
  • 54
  • 101
9
votes
1 answer

Spark 3.0 is much slower to read json files than Spark 2.4

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going…
smishra
  • 3,122
  • 29
  • 31
7
votes
1 answer

Spark 3.0.1 tasks are failing when using zstd compression codec

I'm using Spark 3.0.1 with user provided Hadoop 3.2.0 and Scala 2.12.10 running on Kubernetes. Everything works fine when reading a parquet file compressed as snappy, however when I try to read a parquet file compressed as zstd several tasks fails…
phzz
  • 151
  • 1
  • 6
5
votes
1 answer

How to solve the following issue in Spark 3.0? Can not create the managed table. The associated location already exists.;

In my spark job, I tried to overwrite a table in each microbatch of structured streaming batchDF.write.mode(SaveMode.Overwrite).saveAsTable("mytable") It generated the following error. Can not create the managed table('`mytable`'). The associated…
yyuankm
  • 295
  • 4
  • 22
4
votes
3 answers

Convert date to ISO week date in Spark

Having dates in one column, how to create a column containing ISO week date? ISO week date is composed of year, week number and weekday. year is not the same as the year obtained using year function. week number is the easy part - it can be…
ZygD
  • 22,092
  • 39
  • 79
  • 102
4
votes
0 answers

Spark Adaptive Query Execution not working as expected

I've tried to use Spark AQE for dynamically coalescing shuffle partitions before writing. On default, spark creates too many files with small sizes. However, AQE feature claims that enabling it will optimize this and merge small files into bigger…
4
votes
1 answer

Spark 3.0 streaming metrics in Prometheus

I'm running a Spark 3.0 application (Spark Structured Streaming) on Kubernetes and I'm trying to use the new native Prometheus metric sink. I'm able to make it work and get all the metrics described here. However, the metrics I really need are the…
3
votes
1 answer

Adaptive Query Execution and Shuffle Partitions

With Adaptive Query Execution in Spark 3+ , can we say that, we don't need to set spark.sql.shuffle.partitions explicitly at different stages in the application ? Given that, we have set spark.sql.adaptive.coalescePartitions.initialPartitionNum As…
Abhishek
  • 83
  • 10
3
votes
1 answer

Does Apache Spark 3 support GPU usage for Spark RDDs?

I am currently trying to run genomic analyses pipelines using Hail(library for genomics analyses written in python and Scala). Recently, Apache Spark 3 was released and it supported GPU usage. I tried spark-rapids library start an on-premise slurm…
3
votes
1 answer

java.lang.NoSuchMethodError: com.google.flatbuffers.FlatBufferBuilder.createString(Ljava/lang/CharSequence;)I

While running pyspark3 with pandas 1.1.5 and pyarrow 2.0.0 getting the below error: Spark Code: import pyarrow import pandas as pd df = pd.DataFrame({'col1' : [1,2,3], 'col2': [4,5,6]}) df_sp =…
Ranga Reddy
  • 2,936
  • 4
  • 29
  • 41
3
votes
1 answer

Spark AQE Post-Shuffle partitions coalesce don't work as expected, and even make data skew in some partitions. Why?

I use global sort on my spark DF, and when I enable AQE and post-shuffle coalesce, my partitions after sort operation become even worse distributed than before. "spark.sql.adaptive.enabled" -> "true", …
3
votes
4 answers

How to create a map column to count occurrences without udaf

I would like to create a Map column which counts the number of occurrences. For instance: +---+----+ | b| a| +---+----+ | 1| b| | 2|null| | 1| a| | 1| a| +---+----+ would result in +---+--------------------+ | b| …
BlueSheepToken
  • 5,751
  • 3
  • 17
  • 42
3
votes
2 answers

PySaprk- Perform Merge in Synapse using Databricks Spark

We are having a tricky situation while performing ACID operation using Databricks Spark . We want to perform UPSERT on a Azure Synapse table over a JDBC connection using PySpark . We are aware of Spark providing only 2 mode for writing data . APPEND…
3
votes
1 answer

spark 3 error java.lang.UnsatisfiedLinkError: no zstd-jni in java.library.path

After install the Spark3 to redhat 7, everything seems to runs. ``` os.environ['SPARK_HOME'] = "/users/spark/spark-3.0.0-bin-hadoop3.2" os.environ['JAVA_HOME'] ="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.222.b10-0.el7_6.x86_64/jre" ``` a simple join …
yuxu zi
  • 485
  • 1
  • 5
  • 10
2
votes
0 answers

Unable to read data from Spanner table into Spark Job running on Dataproc cluster

I'm doing an integration wherein I'm trying to read data from a simple gcp spanner table into Spark job which is running on a dataproc cluster. For this integration I'm using google-cloud-spanner-jdbc dependency in pom.xml. Though there is no…
1
2 3 4 5 6