Can anyone please explain how Spark Dataframes are better in terms of time taken for execution than Pandas Dataframes. I'm dealing with data with moderate volume and making python function powered transformations
For example, I have a column with numbers from 1 to 100,000 in my dataset and want to perform basic numeric action - creating a new column which is cube of existing numeric column.
from datetime import datetime
import numpy as np
import pandas as pd
def cube(num):
return num**3
array_of_nums = np.arange(0,100000)
dataset = pd.DataFrame(array_of_nums, columns = ["numbers"])
start_time = datetime.now()
# Some complex transformations...
dataset["cubed"] = [cube(x) for x in dataset.numbers]
end_time = datetime.now()
print("Time taken :", (end_time-start_time))
The output is
Time taken : 0:00:00.109349
If i use Spark Dataframe with 10 worker nodes, can I expect the following result? (which is 1/10th of time taken by Pandas DataFrame)
Time taken : 0:00:00.010935