Questions tagged [apache-spark-sql-repartition]

23 questions
3
votes
1 answer

What is the difference between spark.shuffle.partition and spark.repartition in spark?

What I understand is When we repartition any dataframe with value n, data will continue to remain on those n partitions, until you hit any shuffle stages or other value of repartition or coalesce. For Shuffle, it only comes into the play when you…
3
votes
3 answers

Can Coalesce increase partitions of Spark DataFrame

I am trying to understand the difference between coalesce() and repartition(). If I correctly understood this answer, coalesce() can only reduce number of partitions of dataframe and if we try to increase the number of partitions then number of…
2
votes
1 answer

Apache Spark - passing jdbc connection object to executors

I am creating a jdbc object in spark driver and I am using that in executor to access the db. So my concern is that is it the same connection object or executors would get a copy of connection object so there would be separate connection per…
2
votes
2 answers

Apache Spark What happens when repartition($"key") is called when size of all records per key is greater than the size of a single partition?

Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"c1"), then will all the records be shuffled to the same…
1
vote
0 answers

Hanging Task in Databricks

I am applying a pandas UDF to a grouped dataframe in databricks. When I do this, a couple tasks hang forever, while the rest complete quickly. I start by repartitioning my dataset so that each group is in one partition: group_factors = ['a','b','c']…
1
vote
1 answer

understanding spark.default.parallelism

As per the documentation: spark.default.parallelism:Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user spark.default.parallelism: For distributed shuffle operations like…
1
vote
0 answers

How can I reduce the spark tasks when I run a spark job

Here is my spark job stages: It has 260000 tasks because the job rely on more then 200000 small hdfs files, each file about 50MB and it is stored in gzip format I tried using the following settings to reduce the tasks but it didn't work. ... --conf…
xyfs
  • 11
  • 1
1
vote
1 answer

How does pyspark repartition work without column name specified?

There are two dataframes df and df1 Then, let's consider 3 cases: df1 only has the same number of rows as df df1 has the same number of rows as df and, the same number of partitions as df. Think df.repartition(k) and, df1.repartition(k) were…
0
votes
0 answers

Use Spark coalesce without decreasing earlier operations parallelism

Let’s say you had a parallelism of 1000, but you only wanted to write 10 files at the end: load().map(…).filter(…).coalesce(10).save() However, Spark’s will effectively push down the coalesce operation to as early a point as possible, so this will…
idan ahal
  • 707
  • 8
  • 21
0
votes
2 answers

repartition in memory vs file

repartition() creates partition in memory and is used as a read() operation. partitionBy() creates partition in disk and is used as a write operation. How can we confirm there is multiple files in memory while using repartition() If repartition…
0
votes
1 answer

If I repartition by column name does spark understand that it is repartitioned by that column when it is read back

I have a requirement, where I have a huge dataset of over 2 Trillion records. This comes as a result of some join. And post this join, I need to aggregate on a column ('id' column) and get a list of distinct names (collect_set('name')). Now, while…
0
votes
1 answer

How to export SQL files in Synapse to sandbox environment or directly access these SQL files via notebooks?

Is it possible to export published SQL files in your Synapse workspace to your sandbox environment via code and without the use of pipelines? If not is it somehow possible to access your published SQL files via a notebook with e.g. pySpark, Scala,…
0
votes
0 answers

PySpark Performance slow in Reading large fixed width file with long lines to convert to structural

I am trying to convert bit large file 34GB fixed width file into structural format using pySpark, But my job taking too long to complete (Almost 10 hr+), File having large line almost 50K characters which I am trying to split using substring into…
0
votes
0 answers

how to force group of data to be processed by a single executor in spark

let say I have data as +-------+------+-----+---------------+--------+ |Account|nature|value| time|repeated| +-------+------+-----+---------------+--------+ | a| 1| 50|10:05:37:293084| false | | a| 1| …
0
votes
1 answer

Spark number of input partitions vs number of reading tasks

can someone explain to me how Spark determines the number of tasks when reading data? How is it related with the number of partitions of the input file and the number of cores? I have a dataset (91MB) that is divided into 14 partitions (~6.5MB…
1
2