Can someone explain why we need transform
& transform_df
methods separately?
Asked
Active
Viewed 980 times
9

Adil B
- 14,635
- 11
- 60
- 78

sumitkanoje
- 1,217
- 14
- 22
2 Answers
8
There's a small difference between the @transform
and @transform_df
decorators in Code Repositories:
@transform_df
operates exclusively onDataFrame
objects.@transform
operates ontransforms.api.TransformInput
andtransforms.api.TransformOutput
objects rather thanDataFrame
s.
If your data transformation depends exclusively on DataFrame
objects, you can use the @transform_df()
decorator. This decorator injects DataFrame
objects and expects the compute function to return a DataFrame.
Alternatively, you can use the more general @transform()
decorator and explicitly call the dataframe()
method to access a DataFrame
containing your input dataset.

Adil B
- 14,635
- 11
- 60
- 78
2
One addition to the answer of @Adil B.
@transform_df
can handle only one output, whereas in @transform can have multiple, but you are in chagre of writing the output:
from pyspark.sql import DataFrame
from transforms.api import transform_df, Input, Output
@transform_df(
Output("some_foundry_id"),
input_dataset=Input("another_foundy_id"),
)
def compute(input_dataset: DataFrame) -> DataFrame:
return input_dataset
the dataframe you return here will be saved by palantir in the output
from pyspark.sql import DataFrame
from transforms.api import transform, Input, Output
@transform(
input_1=Input("..."),
output_1=Output("..."),
output_2=Output("..."),
)
def compute(input_1: Input, output_1: Output, output_2: Output) -> None:
output_1.write_dataframe(input_1.dataframe())
output_2.write_dataframe(input_1.dataframe())

Grigory Sharkov
- 121
- 1
- 8