Questions tagged [apache-spark-3.0]

27 questions
7
votes
2 answers

Does Spark Supports With Clause like SQL?

I have table employee_1 in spark with attributes id and name(with data), and another table employee_2 with same attributes, i want to load the data by increasing the id values with +1 My With Clause shown below: WITH EXP AS (SELECT ALIASNAME.ID+1…
Ganesh Kumar
  • 133
  • 1
  • 3
  • 12
6
votes
3 answers

Is Star Schema (data modelling) still relevant with the Lake House pattern using Databricks?

The more I read about the Lake House architectural pattern and following the demos from Databricks I hardly see any discussion around Dimensional Modelling like in a traditional data warehouse (Kimball approach). I understand the compute and storage…
2
votes
1 answer

Aggregate function with Expr in PySpark 3.0.3

The following code works well with PySpark 3.2.1 df.withColumn( "total_amount", f.aggregate(f.col("taxes"), f.lit(0.00), lambda acc, x: acc + x["amount"]), ) I've downgraded to PySpark 3.0.3. How to change the above code to something like…
Smaillns
  • 2,540
  • 1
  • 28
  • 40
2
votes
0 answers

ImportError: Pandas >= 0.23.2 must be installed; however, it was not found. / pyspark/pandas are not properly imported in Apache Spark 3.2.1

I have Apache Spark 3.2.1 docker container running and got the below code. 3.2.1 version includes pandas. So I have changed the import line as "from pyspark import pandas as ps" but still I am getting the error …
suj
  • 507
  • 1
  • 8
  • 22
2
votes
3 answers

How to get week of month in Spark 3.0+?

I cannot find any datetime formatting pattern to get the week of month in spark 3.0+ As use of 'W' is deprecated, is there a solution to get week of month without using legacy option? The below code doesn't work for spark 3.2.1 df =…
1
vote
0 answers

How to provide hive metastore information via spark-submit?

Using Spark 3.1, I need to provide the hive configuration via the spark-submit command (not inside the code). Inside the code (which is not the solution I need), I can do the following which works fine (able to list database and select from tables.…
Itération 122442
  • 2,644
  • 2
  • 27
  • 73
1
vote
0 answers

How to suppres INFO spark logs?

I am experimenting with apache spark 3 in intellij by creating a simple standalone scala application. When I run my program I get lots of INFO logs. Based on various SO answers I tried all of the…
Mandroid
  • 6,200
  • 12
  • 64
  • 134
1
vote
0 answers

Issues defining an Aggregator with case class input

I'm trying to define a custom aggregation function which takes a StructType field as an input, using the Aggregator API with Dataframes. Spark version is 3.1.2. Here's a reduced example (basic one-field case class, being passed in as a Row and…
1
vote
1 answer

How to set driver python path in cluster mode (pyspark)

My program runs fine in client mode ,but when I try to run in cluster mode if fails ,the reason for that is the python version on the cluster nodes is different I am trying to set the python driver path when my application runs in cluster mode below…
1
vote
0 answers

How force spark to move record nonnull fields as _corrupt_record?

Consider the code: import com.amazonaws.auth.DefaultAWSCredentialsProviderChain import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types.{StringType, StructField, StructType} object JsonAwsSchemaExample extends App{ val…
Cherry
  • 31,309
  • 66
  • 224
  • 364
1
vote
1 answer

How to round timestamp to 10 minutes in Spark 3.0?

I have a timestamp like that in $"my_col": 2022-01-21 22:11:11 with date_trunc("minute",($"my_col")) 2022-01-21 22:11:00 with date_trunc("hour",($"my_col")) 2022-01-21 22:00:00 What is a Spark 3.0 way to get 2022-01-21 22:10:00 ?
Eljah
  • 4,188
  • 4
  • 41
  • 85
0
votes
0 answers

Spark 3.3.1 picking up current date automatically in data frame if date is missing from given timestamp and not marking it as _corrupt record

I am using Spark 3.3.1 to read input CSV file having below header and value ID, CREATE_DATE 1, 14:42:23.0 I'm passing only time(HH:MM:SS.SSS) where as DATE(YYYY-MM-DD) is missing in CREATE_DATE field and reading CREATE_DATE field as…
0
votes
0 answers

Spark Scala app getting NullPointerException while migrating in databricks from DBR 7.3 LTS(spark 3.0.1) to 9.1 LTS(spark 3.1.2)

We are migrating our Spark Scala jobs from AWS EMR (6.2.1 and Spark version - 3.0.1) to Lakehouse and few of our jobs are failing due to NullPointerException. When we tried to lower the Databricks Runtime environment to 7.3 LTS, it is working fine…
PPPP
  • 561
  • 1
  • 4
  • 14
0
votes
1 answer

Unable to set "spark.driver.maxResultSize" in Spark 3.0

I am trying to convert a spark dataframe into a pandas dataframe. I have a sufficiently large driver. I am trying to set the spark.driver.maxResultSize value , like this spark = ( SparkSession .builder .appName('test') …
Ayan Biswas
  • 1,641
  • 9
  • 39
  • 66
0
votes
0 answers

Migrating from Spark 2.4 to Spark 3: How to convert a class that extends SharedSQLContext to use object SparkSession?

In Spark 2.4, there exists class SharedSQLContext and related APIs have been removed in Spark 3. The equivalent of SharedSQLContext from Spark 2.4 is the SparkSession object in Spark 3. I'm relatively new to Scala/Java, how do I approach converting…
sojim2
  • 1,245
  • 2
  • 15
  • 38
1
2