how to force group of data to be processed by a single executor in spark

Question

let say I have data as

+-------+------+-----+---------------+--------+
|Account|nature|value|           time|repeated|
+-------+------+-----+---------------+--------+
|      a|     1|   50|10:05:37:293084| false  |
|      a|     1|   50|10:06:46:806510| false  |
|      a|     0|   50|11:19:42:951479| false  |
|      a|     1|   40|19:14:50:479055| false  |
|      a|     0|   50|16:56:17:251624| false  |
|      a|     1|   40|16:33:12:133861| false  |
|      a|     1|   20|17:33:01:385710| false  |
|      b|     0|   30|12:54:49:483725| false  |
|      b|     0|   40|19:23:25:845489| false  |
|      b|     1|   30|10:58:02:276576| false  |
|      b|     1|   40|12:18:27:161290| false  |
|      b|     0|   50|12:01:50:698592| false  |
|      b|     0|   50|08:45:53:894441| false  |
|      b|     0|   40|17:36:55:827330| false  |
|      b|     1|   50|17:18:41:728486| false  |
+-------+------+-----+---------------+--------+

I wanted each account to be processed by single executor like account a processed by single executor and b by different executor thus there is no parallalism for a single account.

I read about repartition(partitionExprs:column*) and repartition(account) will partition by account so according to my data in example, will it create 2 partition and send as task to different executors.?How repartition works and send to executors?

I also looked in to `GroupByKey() for PairRDD..... what's the difference in between two of these how they are partition and forwarded to executors?

You could also have a look at [mapPartitions](https://spark.apache.org/docs/3.2.2/api/java/org/apache/spark/sql/Dataset.html#mapPartitions-org.apache.spark.api.java.function.MapPartitionsFunction-org.apache.spark.sql.Encoder-) — werner, Mar 06 '23 at 21:51

how to force group of data to be processed by a single executor in spark

0 Answers0