1

I have a program that reads a CSV file from the local filesystem. Spark ( run in local mode ) in actually using all 16 cores of the instance. So I have 16 tasks running in parallel.

Now , what I want to do is to tune its performance when reading the file.

When checking in Spark UI , I found that each task reads 128MB of the file as input size (default value of Hadoop's blocksize). As the instance has 120GB of RAM, I would like to increase the input size per task.

What configuration should I run to do so ?

philantrovert
  • 9,904
  • 3
  • 37
  • 61
Hela Chikhaoui
  • 122
  • 2
  • 10
  • Do you intend to change the block size for the entire cluster or you need to change only for your job ? Or do you want to use less tasks for your job ? – Deepan Ram Apr 12 '18 at 13:25
  • less number of tasks in total , but I wanna keep 16 tasks running in parallel while leveraging the block size for faster processing :) – Hela Chikhaoui Apr 12 '18 at 13:30

2 Answers2

-1

You can try changing the block size value by setting the following property in hdfs-site.xml:

<property> 
    <name>dfs.block.size<name> 
    <value>134217728<value> 
    <description>Block size<description> 
<property>
Pawan Mishra
  • 7,212
  • 5
  • 29
  • 39
  • I tried to do this programmatically by : spark.sparkContext.set("dfs.block.size","256") but it didn't work – Hela Chikhaoui Apr 12 '18 at 13:30
  • If you are interested in increasing the parallelism then you should increase the number of executors. In spark, partitions are assigned to executors which in-turn gets mapped as tasks. How many executors do you see running in Spark UI? – Pawan Mishra Apr 12 '18 at 13:45
  • Since you have 16 cores available on your machine, you can spin up 3 executors with 4 cores each having 20g of memory. This is just a rough calculation. You can tune the numbers as per your application performance. – Pawan Mishra Apr 12 '18 at 13:47
  • It's in local mode so there's only one jvm process running : executor = driver , and the driver/executor will use all cores , so configuring the executors won't be effective – Hela Chikhaoui Apr 12 '18 at 14:25
-1

Two options you can do :-

1) while reading reduce the num tasks: -

val file = sc.textFile("/path/to/file.txt.gz", < less num of partitions>);

2) If you want to set a higher block size :-

conf.set("dfs.block.size", "128m")

You can also try setting :-
mapreduce.input.fileinputformat.split.minsize mapreduce.input.fileinputformat.split.maxsize

Deepan Ram
  • 842
  • 1
  • 10
  • 25