Spark Dataframe not distributed

Question

I can't understand why my dataframe is only on one node. I have a cluster of 14 machines with 4 physical CPU on a spark standalone cluster.

I am connected through a notebook and create my spark context :

I expect a parralelism of 8 partitions, but when I create a dataframe I get only one partition :

What am I missing ?

Thanks to anser from user8371915 I repartitions my dataframe (I was reading a compressed file (.csv.gz) so I understand in splittable.

But When I do a "count" on it, I see it as being executed only on one executor : Here namely on executor n°1, even if the file is 700 Mb large, and is on 6 blocks on HDFS. As far as I understand, the calculus should be over 10 cores, over 5 nodes ... But everything is calculated only on one node :-(

score 3 · Answer 1 · answered Jan 31 '18 at 23:19

3

There are two possibilities:

File size is below spark.sql.files.maxPartitionBytes.
File is compressed using unsplitable format like gzip.

In the first case you may consider adjusting parameters, but if you go with defaults, it is already small.

In the second case it is best to unpack file before loading to Spark. If you cannot do that, repartition after loading, but it'll be slow.

answered Jan 31 '18 at 23:19

Alper t. Turker

34,230
9
83
115

I think I am on the second case, I updated my questions accordingly – Romain Jouin Feb 01 '18 at 08:44
it seems like a "repartition" on a "gz" file doesn't work. I unzipped the "gz" directly in hdfs, and rebuilt a dataframe on it => now I can change the number of partitions, and see faster results. – Romain Jouin Feb 01 '18 at 09:53
It will work, fine but it happens __after__ loading. First stage won't be affected. – Alper t. Turker Feb 01 '18 at 10:40
I couldn't really repartition a gz files. I had to transform it on csv to get partioning working correctly. – Romain Jouin Feb 01 '18 at 20:20

Spark Dataframe not distributed

1 Answers1

Linked