How to change input block size when reading a file in Apache Spark with Scala in a localfilesystem ( not HDFS )

Question

I have a program that reads a CSV file from the local filesystem. Spark ( run in local mode ) in actually using all 16 cores of the instance. So I have 16 tasks running in parallel.

Now , what I want to do is to tune its performance when reading the file.

When checking in Spark UI , I found that each task reads 128MB of the file as input size (default value of Hadoop's blocksize). As the instance has 120GB of RAM, I would like to increase the input size per task.

What configuration should I run to do so ?

Do you intend to change the block size for the entire cluster or you need to change only for your job ? Or do you want to use less tasks for your job ? — Deepan Ram, Apr 12 '18 at 13:25
less number of tasks in total , but I wanna keep 16 tasks running in parallel while leveraging the block size for faster processing :) — Hela Chikhaoui, Apr 12 '18 at 13:30

score -1 · Answer 1 · answered Apr 12 '18 at 13:22

-1

You can try changing the block size value by setting the following property in hdfs-site.xml:

<property> 
    <name>dfs.block.size<name> 
    <value>134217728<value> 
    <description>Block size<description> 
<property>

answered Apr 12 '18 at 13:22

Pawan Mishra

7,212
5
29
39

I tried to do this programmatically by : spark.sparkContext.set("dfs.block.size","256") but it didn't work – Hela Chikhaoui Apr 12 '18 at 13:30
If you are interested in increasing the parallelism then you should increase the number of executors. In spark, partitions are assigned to executors which in-turn gets mapped as tasks. How many executors do you see running in Spark UI? – Pawan Mishra Apr 12 '18 at 13:45
Since you have 16 cores available on your machine, you can spin up 3 executors with 4 cores each having 20g of memory. This is just a rough calculation. You can tune the numbers as per your application performance. – Pawan Mishra Apr 12 '18 at 13:47
It's in local mode so there's only one jvm process running : executor = driver , and the driver/executor will use all cores , so configuring the executors won't be effective – Hela Chikhaoui Apr 12 '18 at 14:25

score -1 · Answer 2 · answered Apr 12 '18 at 13:38

Two options you can do :-

1) while reading reduce the num tasks: -

val file = sc.textFile("/path/to/file.txt.gz", < less num of partitions>);

2) If you want to set a higher block size :-

conf.set("dfs.block.size", "128m")

You can also try setting :-
mapreduce.input.fileinputformat.split.minsize mapreduce.input.fileinputformat.split.maxsize

How to change input block size when reading a file in Apache Spark with Scala in a localfilesystem ( not HDFS )

2 Answers2