How to find the average of a set of columns in a row of a pyspark dataframe and add it as another column to the same dataframe?

Asked Sep 19 '18 at 13:23

Active Sep 19 '18 at 13:42

Viewed 403 times

I have a DataFrame consisting of 500 columns out of which, for each row I need to get the average of set of columns starting with "country_".

expr=[F.sum(train_data_df[x])/colCount for x in train_data_df.columns if 'country_' in x]
avg_train_data_df = train_data_df.withColumn('avg', *expr)

I get the following error response:

TypeError: withColumn() takes 3 positional arguments but 212 were given

edited Sep 19 '18 at 13:42

rafaelc

asked Sep 19 '18 at 13:23

Possible duplicate of [Spark DataFrame: Computing row-wise mean (or any aggregate operation)](https://stackoverflow.com/questions/32670958/spark-dataframe-computing-row-wise-mean-or-any-aggregate-operation) – pault Sep 19 '18 at 13:43
You should be using `__builtin__.sum` and not `pyspark.sql.functions.sum` - something like `train_data_df.withColumn('avg', sum(F.col(x) for x in df.columns if x.startswith("country_")) / n)` where `n = len(x for x in df.columns if x.startswith("country_")`. Also note that if you're dealing with integers in python 2, you may have to cast one of the operands in the division to float. – pault Sep 19 '18 at 13:46
Thanks @pault. The explanation about the built in sum solved the case – Hariprasath Thiagarajan Sep 19 '18 at 14:04

0 Answers0