2

Here are several ways of creating a union of dataframes, which (if any) is best /recommended when we are talking about big dataframes? Should I create an empty dataframe first or continuously union to the first dataframe created?

Empty Dataframe creation

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("A", StringType(), False), 
    StructField("B", StringType(), False), 
    StructField("C", StringType(), False)
])

pred_union_df = spark_context.parallelize([]).toDF(schema)

Method 1 - Union as you go:

for ind in indications:
    fitted_model = get_fitted_model(pipeline, train_balanced_df, ind)
    pred = get_predictions(fitted_model, pred_output_df, ind)
    pred_union_df  = pred_union_df.union(pred[['A', 'B', 'C']])

Method 2 - Union at the end:

all_pred = []
for ind in indications:
    fitted_model = get_fitted_model(pipeline, train_balanced_df, ind)
    pred = get_predictions(fitted_model, pred_output_df, ind)
    all_pred.append(pred)
pred_union_df = pred_union_df.union(all_pred)

Or do I have it all wrong?

Edit: Method 2 was not possible as I thought it would be from this answer. I had to loop through the list and union each dataframe.

Pouya Barrach-Yousefi
  • 1,207
  • 1
  • 13
  • 27

1 Answers1

3

Method 2 is always preferred since it avoid the long lineage issue.

Although DataFrame.union only takes one DataFrame as argument, RDD.union does take a list. Given your sample code, you could try to union them before calling toDF.

If your data is on disk, you could also try to load them all at once to achieve union, e.g.,

dataframe = spark.read.csv([path1, path2, path3])

shuaiyuancn
  • 2,744
  • 3
  • 24
  • 32