9

Can someone explain why we need transform & transform_df methods separately?

Adil B
  • 14,635
  • 11
  • 60
  • 78
sumitkanoje
  • 1,217
  • 14
  • 22

2 Answers2

8

There's a small difference between the @transform and @transform_df decorators in Code Repositories:

  • @transform_df operates exclusively on DataFrame objects.
  • @transform operates on transforms.api.TransformInput and transforms.api.TransformOutput objects rather than DataFrames.

If your data transformation depends exclusively on DataFrame objects, you can use the @transform_df() decorator. This decorator injects DataFrame objects and expects the compute function to return a DataFrame.

Alternatively, you can use the more general @transform() decorator and explicitly call the dataframe() method to access a DataFrame containing your input dataset.

Adil B
  • 14,635
  • 11
  • 60
  • 78
2

One addition to the answer of @Adil B. @transform_df can handle only one output, whereas in @transform can have multiple, but you are in chagre of writing the output:

from pyspark.sql import DataFrame
from transforms.api import transform_df, Input, Output

@transform_df(
    Output("some_foundry_id"),
    input_dataset=Input("another_foundy_id"),
)
def compute(input_dataset: DataFrame) -> DataFrame:
    return input_dataset

the dataframe you return here will be saved by palantir in the output

from pyspark.sql import DataFrame
from transforms.api import transform, Input, Output

@transform(
    input_1=Input("..."),
    output_1=Output("..."),
    output_2=Output("..."),
)
def compute(input_1: Input, output_1: Output, output_2: Output) -> None:
    output_1.write_dataframe(input_1.dataframe())
    output_2.write_dataframe(input_1.dataframe())
Grigory Sharkov
  • 121
  • 1
  • 8