Here are several ways of creating a union of dataframes, which (if any) is best /recommended when we are talking about big dataframes? Should I create an empty dataframe first or continuously union to the first dataframe created?
Empty Dataframe creation
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("A", StringType(), False),
StructField("B", StringType(), False),
StructField("C", StringType(), False)
])
pred_union_df = spark_context.parallelize([]).toDF(schema)
Method 1 - Union as you go:
for ind in indications:
fitted_model = get_fitted_model(pipeline, train_balanced_df, ind)
pred = get_predictions(fitted_model, pred_output_df, ind)
pred_union_df = pred_union_df.union(pred[['A', 'B', 'C']])
Method 2 - Union at the end:
all_pred = []
for ind in indications:
fitted_model = get_fitted_model(pipeline, train_balanced_df, ind)
pred = get_predictions(fitted_model, pred_output_df, ind)
all_pred.append(pred)
pred_union_df = pred_union_df.union(all_pred)
Or do I have it all wrong?
Edit: Method 2 was not possible as I thought it would be from this answer. I had to loop through the list and union each dataframe.