3

I can't understand why my dataframe is only on one node. I have a cluster of 14 machines with 4 physical CPU on a spark standalone cluster. enter image description here

I am connected through a notebook and create my spark context :

enter image description here

I expect a parralelism of 8 partitions, but when I create a dataframe I get only one partition : enter image description here

What am I missing ?

Thanks to anser from user8371915 I repartitions my dataframe (I was reading a compressed file (.csv.gz) so I understand in splittable. enter image description here

But When I do a "count" on it, I see it as being executed only on one executor : enter image description here Here namely on executor n°1, even if the file is 700 Mb large, and is on 6 blocks on HDFS. As far as I understand, the calculus should be over 10 cores, over 5 nodes ... But everything is calculated only on one node :-(

Romain Jouin
  • 4,448
  • 3
  • 49
  • 79

1 Answers1

3

There are two possibilities:

In the first case you may consider adjusting parameters, but if you go with defaults, it is already small.

In the second case it is best to unpack file before loading to Spark. If you cannot do that, repartition after loading, but it'll be slow.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115