0

repartition() creates partition in memory and is used as a read() operation. partitionBy() creates partition in disk and is used as a write operation.

  1. How can we confirm there is multiple files in memory while using repartition()
  2. If repartition only creates partition in memory articles.repartition(1).write.saveAsTable("articles_table", format = 'orc', mode = 'overwrite') , why does this operation only creates one file? And how is this different from partitionBy()?
Koedlt
  • 4,286
  • 8
  • 15
  • 33
Blue Clouds
  • 7,295
  • 4
  • 71
  • 112
  • 1
    What do you mean with "repartition is used as a read operation"? `repartition` is a method of the `Dataset` class, that shuffles your data around to create a `Dataset` with a new amount of partitions. It is not a read operation, nor used as a read operation. Also, `.repartition` can spill over to disk if the amount of data shuffled to the executors is large enough. Maybe you can share your source of information so we can see where the confusion comes from? – Koedlt Jul 13 '23 at 06:50
  • https://sparkbyexamples.com/pyspark/pyspark-repartition-vs-partitionby/#:~:text=repartition()%20creates%20a%20specified,memory%20partition%20and%20partition%20column. – Blue Clouds Jul 14 '23 at 05:10
  • 1) creating dataset is a read operation. partitionBy is a write operation to the disk (not repartition) – Blue Clouds Jul 14 '23 at 05:10

2 Answers2

1

partitionBy indeed has an effect on how your files will look on disk, and indeed is used when writing a file (it is a method of the DataFrameWriter class).

That, however, does not mean that the repartition has no effect at all on what will be written to disk.

Let's take the following example:

df = spark.createDataFrame([
  (1,2,3),
  (2,2,3),
  (3,20,300),
  (1,24,299),
  (5,26,312),
  (5,28,322),
  (5,9,2)
], ["colA", "colB", "colC"])

df.write.partitionBy("colA").parquet("using_partitionBy.parquet")
df.repartition(4).write.parquet("using_repartition.parquet")

In here, we create a simple dataframe and write it away using 2 methods:

1) By using partitionBy

The output file structure on disk looks like this:

tree using_partitionBy.parquet/
using_partitionBy.parquet/
├── colA=1
│   ├── part-00000-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
│   └── part-00002-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
├── colA=2
│   └── part-00001-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
├── colA=3
│   └── part-00001-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
├── colA=5
│   ├── part-00002-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
│   └── part-00003-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
└── _SUCCESS

We see that this created 6 "subfiles", in 4 "subdirectories". Information about the data values (like colA=1) is actually stored on disk. This enables you to do big improvements in subsequent queries that would need to read this file. Imagine that you would need to read all the values where colA=1, that would be a trivial task (ignore the other subdirectories).

2) By using repartition(4)

The output file structure on disk looks like this:

tree using_repartition.parquet/
using_repartition.parquet/
├── part-00000-6dafa06d-cea6-4af0-ba0d-9c761de0fd50-c000.snappy.parquet
├── part-00001-6dafa06d-cea6-4af0-ba0d-9c761de0fd50-c000.snappy.parquet
├── part-00002-6dafa06d-cea6-4af0-ba0d-9c761de0fd50-c000.snappy.parquet
├── part-00003-6dafa06d-cea6-4af0-ba0d-9c761de0fd50-c000.snappy.parquet
└── _SUCCESS

We see that 4 "subfiles" were created and NO "subdirectories" were made. Actually these "subfiles" represent your partitions inside of Spark. Since you're typically working with very big data in Spark, all your data has to be partitioned some way.

Each partition will be processed by 1 task, which can be taken up by 1 core of your cluster. Once this task is taken up by a core and after doing all the necessary processing, your core will write away this output on disk in one of these "subfiles". When it has finished writing away this "subfile", your core is ready to read another partition.

When to use partitionBy and repartition

This is a bit opinionated and surely not exhaustive, but might give you some insight into what to use.

partitionBy and repartition can be used for different goals:

  • Use partitionBy if:
    • You want to write data on disk on which you want to have big performance benefits to read. This will mostly be useful when you have a column you will do lots of filtering on whose cardinality is not too high
  • Use repartition if:
    • You want to tune the size of your partitions to your cluster size, to improve performance on your jobs
    • You want to write away a file with a partition size that makes sense, but using partitionBy on any column would have a way too high cardinality (imagine time series data on sensors)
Koedlt
  • 4,286
  • 8
  • 15
  • 33
  • Such clear explanation! some questions still: so repartition = read operation is not correct? Both repartition and partitionby are write operations, correct? Also partition is always done on the files system, and not in memory, agree? – Blue Clouds Jul 17 '23 at 09:31
  • 1) `repartition` == read operation is not correct. 2) `partitionBy` is indeed something you do when writing a file. `repartition` is not only done when writing a file. You can also do this on an intermediate dataset. It is a shuffle operation that happens mostly in memory, until your cluster hasn't got enough memory left and then it will spill over to disk. 3) partitions are an integral part of the whole discussion. Since typically 1 partition = 1 task, 1 partition will be read into memory by your task. – Koedlt Jul 17 '23 at 18:27
0
  1. You mean how to confirm that when you do for example .repartition(100) you will get 100 files on output? I was checking it in SparkUI, number of tasks = number of partitions = number of written files

  2. With .repartition(1) you are moving whole dataset to one partition, which will be processed as 1 task by one core and written to one file. There is no way to process single task in paralell so Spark has no choice but to store everything in one file

M_S
  • 2,863
  • 2
  • 2
  • 17
  • #1 (not on files, but in memory) – Blue Clouds Jul 13 '23 at 06:34
  • #2 how is that different from partitionBy() – Blue Clouds Jul 13 '23 at 06:35
  • 2# I am not sure if i understand your questions. Those two are different functions, you can't use partitionBy in that way as it is always repartitioning by expression and you can pass only columns, not a number of partitions (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.partitionBy.html) – M_S Jul 13 '23 at 06:53