How to iterate over one column to get the mean for each row?

Question

I have a pyspark dataset in which there's one column named as numerical data. I calculated this dataset from somewhere else. Example:

 Numerical_fields| Age | Height | Weight

Now, I need to calculate mean for each value in this column. For this I tried Looping, for i in df.collect(): how can I get the mean?

data_collect = df2.collect() for f in df2.collect(): print (f.mean) — Palkin Jangra, Apr 13 '22 at 08:47
Please edit your question and put the code there, correctly formatted, and more of your code than just that one line please — Matthias, Apr 13 '22 at 09:42
Please see [How to make good reproducible Apache Spark examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) and update your question accordingly. — desertnaut, Apr 13 '22 at 13:25

score 0 · Answer 1 · answered Apr 13 '22 at 14:09

To get a df with the mean of each value in Numerical_fields you can do the following:

avg_df = df.groupby(df.Numerical_fields).avg("Age", "Height", "Weight")

avg_df will now contain one line per unique value in Numerical_fields with the averages of the other columns for that value.

1 Answers1