0

I have a data frame with int values and I'd like to sum every column individually and then test if that column's sum is above 5. If the column's sum is above 5 then I'd like to add it to feature_cols. The answers I've found online only work for pandas and not PySpark. (I'm using Databricks)

Here is what I have so far:

working_cols = df.columns

for x in range(0, len(working_cols)): 
    if df.agg(sum(working_cols[x])) > 5:
        feature_cols.append(working_cols[x])

The current output for this is that feature_cols has every column, even though some have a sum less than 5.

Out[166]: 
['Column_1',
 'Column_2',
 'Column_3',
 'Column_4',
 'Column_5',
 'Column_6',
 'Column_7',
 'Column_8',
 'Column_9',
 'Column_10']
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Francis
  • 3
  • 1
  • 2

1 Answers1

1

I am not an expert in python but in your loop you are comparing a DataFrame[sum(a): bigint] with 5, and for some reason the answer is True.

df.agg(sum(working_cols[x])).collect()[0][0] should give you what you want. I actually collect the dataframe to the driver, select the first row (there is only one) and select the first column (only one as well).

Note that your approach is not optimal in terms of perf. You could compute all the sums with only one pass of the dataframe like this:

sums = [F.sum(x).alias(str(x)) for x in df.columns]
d = df.select(sums).collect()[0].asDict()

With this code, you would have a dictionary that assocites each column name to its sum and on which you could apply any logic that's of intrest to you.

Oli
  • 9,766
  • 5
  • 25
  • 46