Highest Voted 'apache-spark-sql-repartition' Questions

3

votes

1 answer

What is the difference between spark.shuffle.partition and spark.repartition in spark?

What I understand is When we repartition any dataframe with value n, data will continue to remain on those n partitions, until you hit any shuffle stages or other value of repartition or coalesce. For Shuffle, it only comes into the play when you…

asked Dec 10 '22 at 05:52

Rushabh Gujarathi

136
1
8

3

votes

3 answers

Can Coalesce increase partitions of Spark DataFrame

I am trying to understand the difference between coalesce() and repartition(). If I correctly understood this answer, coalesce() can only reduce number of partitions of dataframe and if we try to increase the number of partitions then number of…

apache-spark apache-spark-sql apache-spark-sql-repartition

asked Sep 27 '19 at 06:57

Niketa

453
2
9
24

2

votes

1 answer

Apache Spark - passing jdbc connection object to executors

I am creating a jdbc object in spark driver and I am using that in executor to access the db. So my concern is that is it the same connection object or executors would get a copy of connection object so there would be separate connection per…

apache-spark spark-jdbc apache-spark-sql-repartition

asked Mar 05 '22 at 20:01

Suparn Lele

23
3

2

votes

2 answers

Apache Spark What happens when repartition($"key") is called when size of all records per key is greater than the size of a single partition?

Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"c1"), then will all the records be shuffled to the same…

scala apache-spark apache-spark-sql apache-spark-sql-repartition

asked Sep 23 '21 at 10:15

Arjunlal M.A

119
6

1

vote

0 answers

Hanging Task in Databricks

I am applying a pandas UDF to a grouped dataframe in databricks. When I do this, a couple tasks hang forever, while the rest complete quickly. I start by repartitioning my dataset so that each group is in one partition: group_factors = ['a','b','c']…

apache-spark databricks user-defined-functions apache-spark-sql-repartition

asked Jun 12 '23 at 18:34

Gary

11
1

1

vote

1 answer

understanding spark.default.parallelism

As per the documentation: spark.default.parallelism:Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user spark.default.parallelism: For distributed shuffle operations like…

apache-spark apache-spark-sql apache-spark-sql-repartition

asked Dec 11 '22 at 16:51

figs_and_nuts

4,870
2
31
56

1

vote

0 answers

How can I reduce the spark tasks when I run a spark job

Here is my spark job stages: It has 260000 tasks because the job rely on more then 200000 small hdfs files, each file about 50MB and it is stored in gzip format I tried using the following settings to reduce the tasks but it didn't work. ... --conf…

apache-spark merge apache-spark-sql-repartition

asked Nov 14 '22 at 11:50

xyfs

11
1

1

vote

1 answer

How does pyspark repartition work without column name specified?

There are two dataframes df and df1 Then, let's consider 3 cases: df1 only has the same number of rows as df df1 has the same number of rows as df and, the same number of partitions as df. Think df.repartition(k) and, df1.repartition(k) were…

apache-spark pyspark apache-spark-sql apache-spark-sql-repartition

asked Feb 07 '22 at 08:55

figs_and_nuts

4,870
2
31
56

0

votes

0 answers

Use Spark coalesce without decreasing earlier operations parallelism

Let’s say you had a parallelism of 1000, but you only wanted to write 10 files at the end: load().map(…).filter(…).coalesce(10).save() However, Spark’s will effectively push down the coalesce operation to as early a point as possible, so this will…

apache-spark apache-spark-sql-repartition

asked Jul 17 '23 at 13:22

idan ahal

707
8
21

0

votes

2 answers

repartition in memory vs file

repartition() creates partition in memory and is used as a read() operation. partitionBy() creates partition in disk and is used as a write operation. How can we confirm there is multiple files in memory while using repartition() If repartition…

apache-spark pyspark partition-by apache-spark-sql-repartition

asked Jul 13 '23 at 06:24

Blue Clouds

7,295
4
71
112

0

votes

1 answer

If I repartition by column name does spark understand that it is repartitioned by that column when it is read back

I have a requirement, where I have a huge dataset of over 2 Trillion records. This comes as a result of some join. And post this join, I need to aggregate on a column ('id' column) and get a list of distinct names (collect_set('name')). Now, while…

apache-spark pyspark apache-spark-sql-repartition

asked Jun 09 '23 at 12:28

Praveen Kumar B N

85
2
5

0

votes

1 answer

How to export SQL files in Synapse to sandbox environment or directly access these SQL files via notebooks?

Is it possible to export published SQL files in your Synapse workspace to your sandbox environment via code and without the use of pipelines? If not is it somehow possible to access your published SQL files via a notebook with e.g. pySpark, Scala,…

azure pyspark apache-spark-sql azure-synapse apache-spark-sql-repartition

asked May 25 '23 at 10:26

ByronSchuurman

131
9

0

votes

0 answers

PySpark Performance slow in Reading large fixed width file with long lines to convert to structural

I am trying to convert bit large file 34GB fixed width file into structural format using pySpark, But my job taking too long to complete (Almost 10 hr+), File having large line almost 50K characters which I am trying to split using substring into…

apache-spark pyspark fixed-width google-spark-operator apache-spark-sql-repartition

asked Mar 03 '23 at 03:29

Sanjay Bagal

99
1
8

0

votes

0 answers

how to force group of data to be processed by a single executor in spark

let say I have data as +-------+------+-----+---------------+--------+ |Account|nature|value| time|repeated| +-------+------+-----+---------------+--------+ | a| 1| 50|10:05:37:293084| false | | a| 1| …

java apache-spark parallel-processing apache-spark-sql-repartition

asked Mar 02 '23 at 22:33

Programmer

117
2
14

0

votes

1 answer

Spark number of input partitions vs number of reading tasks

can someone explain to me how Spark determines the number of tasks when reading data? How is it related with the number of partitions of the input file and the number of cores? I have a dataset (91MB) that is divided into 14 partitions (~6.5MB…

apache-spark pyspark apache-spark-sql apache-spark-sql-repartition

asked Jan 21 '23 at 11:24

Pawel

1

Questions tagged [apache-spark-sql-repartition]