I have a data frame with int values and I'd like to sum every column individually and then test if that column's sum is above 5. If the column's sum is above 5 then I'd like to add it to feature_cols. The answers I've found online only work for pandas and not PySpark. (I'm using Databricks)
Here is what I have so far:
working_cols = df.columns
for x in range(0, len(working_cols)):
if df.agg(sum(working_cols[x])) > 5:
feature_cols.append(working_cols[x])
The current output for this is that feature_cols has every column, even though some have a sum less than 5.
Out[166]:
['Column_1',
'Column_2',
'Column_3',
'Column_4',
'Column_5',
'Column_6',
'Column_7',
'Column_8',
'Column_9',
'Column_10']