The sample code from Florian
-----------+-----------+-----------+
|ball_column|keep_the |hall_column|
+-----------+-----------+-----------+
| 0| 7| 14|
| 1| 8| 15|
| 2| 9| 16|
| 3| 10| 17|
| 4| 11| 18|
| 5| 12| 19|
| 6| 13| 20|
+-----------+-----------+-----------+
The first part of the code drops columns name in the banned list
#first part of the code
banned_list = ["ball","fall","hall"]
condition = lambda col: any(word in col for word in banned_list)
new_df = df.drop(*filter(condition, df.columns))
So the above piece of code should drop the ball_column
and hall_column
.
The second part of the code buckets specific columns in the list. For this example, we will bucket the only one remaining, keep_column
.
bagging =
Bucketizer(
splits=[-float("inf"), 10, 100, float("inf")],
inputCol='keep_the',
outputCol='keep_the')
Now bagging the columns using pipeline was as follows
model = Pipeline(stages=bagging).fit(df)
bucketedData = model.transform(df)
How can I add the first block of the code (banned list
, condition
, new_df
) to the ml pipeline as a stage?