Highest Voted 'spark3' Questions

30

votes

6 answers

to_date fails to parse date in Spark 3.0

I am trying to parse date using to_date() but I get the following exception. SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '12/1/2010 8:26' in the new parser. You can set…

asked Jul 16 '20 at 21:44

noobie-php

6,817
15
54
101

9

votes

1 answer

Spark 3.0 is much slower to read json files than Spark 2.4

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going…

scala apache-spark java-11 spark3

asked Jun 27 '20 at 23:54

smishra

3,122
29
31

7

votes

1 answer

Spark 3.0.1 tasks are failing when using zstd compression codec

I'm using Spark 3.0.1 with user provided Hadoop 3.2.0 and Scala 2.12.10 running on Kubernetes. Everything works fine when reading a parquet file compressed as snappy, however when I try to read a parquet file compressed as zstd several tasks fails…

apache-spark spark3 zstd

asked Nov 17 '20 at 13:44

phzz

151
1
6

5

votes

1 answer

How to solve the following issue in Spark 3.0? Can not create the managed table. The associated location already exists.;

In my spark job, I tried to overwrite a table in each microbatch of structured streaming batchDF.write.mode(SaveMode.Overwrite).saveAsTable("mytable") It generated the following error. Can not create the managed table('`mytable`'). The associated…

apache-spark spark-streaming spark3

asked Sep 19 '20 at 09:33

yyuankm

295
4
22

4

votes

3 answers

Convert date to ISO week date in Spark

Having dates in one column, how to create a column containing ISO week date? ISO week date is composed of year, week number and weekday. year is not the same as the year obtained using year function. week number is the easy part - it can be…

apache-spark date pyspark apache-spark-sql spark3

asked Dec 28 '21 at 15:17

ZygD

22,092
39
79
102

4

votes

0 answers

Spark Adaptive Query Execution not working as expected

I've tried to use Spark AQE for dynamically coalescing shuffle partitions before writing. On default, spark creates too many files with small sizes. However, AQE feature claims that enabling it will optimize this and merge small files into bigger…

amazon-web-services apache-spark amazon-s3 pyspark spark3

asked Mar 05 '21 at 08:13

kalyoncu

41
1

4

votes

1 answer

Spark 3.0 streaming metrics in Prometheus

I'm running a Spark 3.0 application (Spark Structured Streaming) on Kubernetes and I'm trying to use the new native Prometheus metric sink. I'm able to make it work and get all the metrics described here. However, the metrics I really need are the…

apache-spark prometheus spark-structured-streaming spark3

asked Oct 19 '20 at 23:41

Jeremie Piotte

205
2
12

3

votes

1 answer

Adaptive Query Execution and Shuffle Partitions

With Adaptive Query Execution in Spark 3+ , can we say that, we don't need to set spark.sql.shuffle.partitions explicitly at different stages in the application ? Given that, we have set spark.sql.adaptive.coalescePartitions.initialPartitionNum As…

apache-spark pyspark apache-spark-sql spark3

asked Apr 26 '23 at 06:40

Abhishek

83
10

3

votes

1 answer

Does Apache Spark 3 support GPU usage for Spark RDDs?

I am currently trying to run genomic analyses pipelines using Hail(library for genomics analyses written in python and Scala). Recently, Apache Spark 3 was released and it supported GPU usage. I tried spark-rapids library start an on-premise slurm…

apache-spark gpu rdd rapids spark3

asked Sep 21 '21 at 17:32

Abhishek Shakya

71
5

3

votes

1 answer

java.lang.NoSuchMethodError: com.google.flatbuffers.FlatBufferBuilder.createString(Ljava/lang/CharSequence;)I

While running pyspark3 with pandas 1.1.5 and pyarrow 2.0.0 getting the below error: Spark Code: import pyarrow import pandas as pd df = pd.DataFrame({'col1' : [1,2,3], 'col2': [4,5,6]}) df_sp =…

pandas apache-spark pyspark spark3

asked Sep 18 '21 at 15:52

Ranga Reddy

2,936
4
29
41

3

votes

1 answer

Spark AQE Post-Shuffle partitions coalesce don't work as expected, and even make data skew in some partitions. Why?

I use global sort on my spark DF, and when I enable AQE and post-shuffle coalesce, my partitions after sort operation become even worse distributed than before. "spark.sql.adaptive.enabled" -> "true", …

apache-spark apache-spark-sql spark-kafka-integration spark3

asked Jul 03 '21 at 12:35

Grigoriev Nick

1,099
8
24

3

votes

4 answers

How to create a map column to count occurrences without udaf

I would like to create a Map column which counts the number of occurrences. For instance: +---+----+ | b| a| +---+----+ | 1| b| | 2|null| | 1| a| | 1| a| +---+----+ would result in +---+--------------------+ | b| …

scala apache-spark spark3

asked Oct 13 '20 at 16:24

BlueSheepToken

5,751
3
17
42

3

votes

2 answers

PySaprk- Perform Merge in Synapse using Databricks Spark

We are having a tricky situation while performing ACID operation using Databricks Spark . We want to perform UPSERT on a Azure Synapse table over a JDBC connection using PySpark . We are aware of Spark providing only 2 mode for writing data . APPEND…

pyspark databricks azure-databricks azure-synapse spark3

asked Sep 08 '20 at 06:57

HimanshuSPaul

278
1
4
19

3

votes

1 answer

spark 3 error java.lang.UnsatisfiedLinkError: no zstd-jni in java.library.path

After install the Spark3 to redhat 7, everything seems to runs. ``` os.environ['SPARK_HOME'] = "/users/spark/spark-3.0.0-bin-hadoop3.2" os.environ['JAVA_HOME'] ="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.222.b10-0.el7_6.x86_64/jre" ``` a simple join …

java spark3

asked Jul 28 '20 at 22:38

yuxu zi

485
1
5
10

2

votes

0 answers

Unable to read data from Spanner table into Spark Job running on Dataproc cluster

I'm doing an integration wherein I'm trying to read data from a simple gcp spanner table into Spark job which is running on a dataproc cluster. For this integration I'm using google-cloud-spanner-jdbc dependency in pom.xml. Though there is no…

apache-spark google-cloud-platform google-cloud-dataproc google-cloud-spanner spark3

asked Aug 22 '23 at 10:13

Sumit

1,360
3
16
29

Questions tagged [spark3]