3

Can anyone please explain how Spark Dataframes are better in terms of time taken for execution than Pandas Dataframes. I'm dealing with data with moderate volume and making python function powered transformations

For example, I have a column with numbers from 1 to 100,000 in my dataset and want to perform basic numeric action - creating a new column which is cube of existing numeric column.

from datetime import datetime
import numpy as np
import pandas as pd

def cube(num):
    return num**3

array_of_nums = np.arange(0,100000)

dataset = pd.DataFrame(array_of_nums, columns = ["numbers"])

start_time = datetime.now() 
# Some complex transformations...
dataset["cubed"] = [cube(x) for x in dataset.numbers]
end_time = datetime.now() 

print("Time taken :", (end_time-start_time))

The output is

Time taken : 0:00:00.109349

If i use Spark Dataframe with 10 worker nodes, can I expect the following result? (which is 1/10th of time taken by Pandas DataFrame)

Time taken : 0:00:00.010935
komandurikc
  • 97
  • 1
  • 2
  • 6
  • 1
    There's a lot that goes into the performance of spark, including http overhead of communication between forked processes, so I wouldn't say just a 10x improvement in speed exactly – C.Nivs Apr 30 '19 at 00:52
  • This could help get an idea of spark/pandas performance (local Spark but most apply for clusters too): https://stackoverflow.com/questions/48815341/why-is-apache-spark-python-so-slow-locally-as-compared-to-pandas – Shaido Apr 30 '19 at 01:14
  • 6
    spark is used to process massive amounts of data (terabytes, petabytes) ... if you can run your logic on a single machine without performance problems then no point really to use spark because you would just be wasting resources trying to distribute something that doesn't need distributed. – thePurplePython Apr 30 '19 at 01:33
  • Possible duplicate of [Why is Apache-Spark - Python so slow locally as compared to pandas?](https://stackoverflow.com/questions/48815341/why-is-apache-spark-python-so-slow-locally-as-compared-to-pandas) – user10938362 Apr 30 '19 at 10:47

1 Answers1

10

1) Pandas data frame is not distributed & Spark's DataFrame is distributed. -> Hence you won't get the benefit of parallel processing in Pandas DataFrame & speed of processing in Pandas DataFrame will be less for large amount of data.

2) Spark DataFrame assures you fault tolerance (It's resilient) & pandas DataFrame does not assure it. -> Hence if your data processing got interrupted/failed in between processing then spark can regenerate the failed result set from lineage (from DAG) . Fault tolerance is not supported in Pandas. You need to implement your own framework to assure it.

  • Thanks for the answer, can you please comment on the performance metrics of Spark DataFrame and Pandas DataFrame? How correct is my assumption mentioned in the question? – komandurikc Apr 30 '19 at 12:21
  • It depends on many parameters like number of partitions in given data, number of cores available on each executors, number of executors available, amount of memory available for each executor, type of scheduler (fair/capacitive,etc)used in resource manager(yarn/standalone/mesos,etc) and many other cluster configuration parameters. Time required to complete spark job is not directly proportional to number of nodes in cluster. If you’re already getting results in milliseconds then it’s probably better to stick to python. If process is taking hours (> 1 or 2 hr) to complete then go for spark. – MIKHIL NAGARALE Apr 30 '19 at 12:53