Questions tagged [apache-spark-3.0]
27 questions
7
votes
2 answers
Does Spark Supports With Clause like SQL?
I have table employee_1 in spark with attributes id and name(with data), and another table employee_2 with same attributes, i want to load the data by increasing the id values with +1
My With Clause shown below:
WITH EXP AS (SELECT ALIASNAME.ID+1…

Ganesh Kumar
- 133
- 1
- 3
- 12
6
votes
3 answers
Is Star Schema (data modelling) still relevant with the Lake House pattern using Databricks?
The more I read about the Lake House architectural pattern and following the demos from Databricks I hardly see any discussion around Dimensional Modelling like in a traditional data warehouse (Kimball approach). I understand the compute and storage…

Satya Azure
- 459
- 7
- 22
2
votes
1 answer
Aggregate function with Expr in PySpark 3.0.3
The following code works well with PySpark 3.2.1
df.withColumn(
"total_amount",
f.aggregate(f.col("taxes"), f.lit(0.00), lambda acc, x: acc + x["amount"]),
)
I've downgraded to PySpark 3.0.3. How to change the above code to something like…

Smaillns
- 2,540
- 1
- 28
- 40
2
votes
0 answers
ImportError: Pandas >= 0.23.2 must be installed; however, it was not found. / pyspark/pandas are not properly imported in Apache Spark 3.2.1
I have Apache Spark 3.2.1 docker container running and got the below code. 3.2.1 version includes pandas. So I have changed the import line as "from pyspark import pandas as ps" but still I am getting the error
…

suj
- 507
- 1
- 8
- 22
2
votes
3 answers
How to get week of month in Spark 3.0+?
I cannot find any datetime formatting pattern to get the week of month in spark 3.0+
As use of 'W' is deprecated, is there a solution to get week of month without using legacy option?
The below code doesn't work for spark 3.2.1
df =…

Kavishka Gamage
- 102
- 2
- 10
1
vote
0 answers
How to provide hive metastore information via spark-submit?
Using Spark 3.1, I need to provide the hive configuration via the spark-submit command (not inside the code).
Inside the code (which is not the solution I need), I can do the following which works fine (able to list database and select from tables.…

Itération 122442
- 2,644
- 2
- 27
- 73
1
vote
0 answers
How to suppres INFO spark logs?
I am experimenting with apache spark 3 in intellij by creating a simple standalone scala application. When I run my program I get lots of INFO logs. Based on various SO answers I tried all of the…

Mandroid
- 6,200
- 12
- 64
- 134
1
vote
0 answers
Issues defining an Aggregator with case class input
I'm trying to define a custom aggregation function which takes a StructType field as an input, using the Aggregator API with Dataframes. Spark version is 3.1.2.
Here's a reduced example (basic one-field case class, being passed in as a Row and…

Matthew Lavengood
- 11
- 1
- 3
1
vote
1 answer
How to set driver python path in cluster mode (pyspark)
My program runs fine in client mode ,but when I try to run in cluster mode if fails ,the reason for that is the python version on the cluster nodes is different
I am trying to set the python driver path when my application runs in cluster mode
below…

Akhil
- 391
- 3
- 20
1
vote
0 answers
How force spark to move record nonnull fields as _corrupt_record?
Consider the code:
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StringType, StructField, StructType}
object JsonAwsSchemaExample extends App{
val…

Cherry
- 31,309
- 66
- 224
- 364
1
vote
1 answer
How to round timestamp to 10 minutes in Spark 3.0?
I have a timestamp like that in $"my_col":
2022-01-21 22:11:11
with date_trunc("minute",($"my_col"))
2022-01-21 22:11:00
with date_trunc("hour",($"my_col"))
2022-01-21 22:00:00
What is a Spark 3.0 way to get
2022-01-21 22:10:00
?

Eljah
- 4,188
- 4
- 41
- 85
0
votes
0 answers
Spark 3.3.1 picking up current date automatically in data frame if date is missing from given timestamp and not marking it as _corrupt record
I am using Spark 3.3.1 to read input CSV file having below header and value
ID, CREATE_DATE
1, 14:42:23.0
I'm passing only time(HH:MM:SS.SSS) where as DATE(YYYY-MM-DD) is missing in CREATE_DATE field and reading CREATE_DATE field as…

mayur kandekar
- 1
- 1
0
votes
0 answers
Spark Scala app getting NullPointerException while migrating in databricks from DBR 7.3 LTS(spark 3.0.1) to 9.1 LTS(spark 3.1.2)
We are migrating our Spark Scala jobs from AWS EMR (6.2.1 and Spark version - 3.0.1) to Lakehouse and few of our jobs are failing due to NullPointerException. When we tried to lower the Databricks Runtime environment to 7.3 LTS, it is working fine…

PPPP
- 561
- 1
- 4
- 14
0
votes
1 answer
Unable to set "spark.driver.maxResultSize" in Spark 3.0
I am trying to convert a spark dataframe into a pandas dataframe. I have a sufficiently large driver. I am trying to set the spark.driver.maxResultSize value , like this
spark = (
SparkSession
.builder
.appName('test')
…

Ayan Biswas
- 1,641
- 9
- 39
- 66
0
votes
0 answers
Migrating from Spark 2.4 to Spark 3: How to convert a class that extends SharedSQLContext to use object SparkSession?
In Spark 2.4, there exists class SharedSQLContext and related APIs have been removed in Spark 3. The equivalent of SharedSQLContext from Spark 2.4 is the SparkSession object in Spark 3.
I'm relatively new to Scala/Java, how do I approach converting…

sojim2
- 1,245
- 2
- 15
- 38