Is it possible to avoid Shuffle while getting distinct data from Spark DataFrame

Question

Let's Assume, we have data gathered from Kafka with 3 partitions (from one topic). Kafka Keys and Values presented as table below:

| key      |                        value                   |
|:---------|:----------------------------------------------:|
| "a-b-c"  | {"field1":"a", "fields2":"b", "field3":"c"}    |
| "d-e-f"  | {"field1":"d", "fields2":"e", "field3":"f"}    |
| "x-y-z"  | {"field1":"x", "fields2":"y", "field3":"z"}    |
...
...

And the Spark DataFrame will be the extracted version of the value column like:

| field1   | field2   |  field3   |
|:---------|:---------|:----------|
|    "a"   |   "b"    |    "c"    |
|    "d"   |   "e"    |    "f"    | 
|    "x"   |   "y"    |    "z"    |
...
...

Data can be sent to Kafka as duplicated. So we need get distinct values on the consumer side. We will have Spark partitions as much as Kafka topic partitions by Spark Streaming default (direct stream is used). And one Spark Executor will process the data per partition (1:1 mapping between partitions and executors)

We know data can be duplicated but sent with partition key. It is obvious applying distinct operations on the partitions without shuffling would be enough.

Is it possible to set conf in spark to avoid Shuffling for those kind of operations?

Is it possible to avoid Shuffle while getting distinct data from Spark DataFrame

0 Answers0