How to find avg length each column in pyspark?

Question

I have created data frame like below:

from pyspark.sql import Row
l = [('Ankit','25','Ankit','Ankit'),('Jalfaizy','2.2','Jalfaizy',"aa"),('saurabh','230','saurabh',"bb"),('Bala','26',"aa","bb")]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], ages=x[1],lname=x[2],mname=x[3]))
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.show()

+----+--------+-----+--------+
|ages|   lname|mname|    name|
+----+--------+-----+--------+
|  25|   Ankit|Ankit|   Ankit|
| 2.2|Jalfaizy|   aa|Jalfaizy|
| 230| saurabh|   bb| saurabh|
|  26|      aa|   bb|    Bala|
+----+--------+-----+--------+

I want find each column avg length for all comuns i.e below my expected output.i.e total number of character in particular column/ number of rows

+----+--------+-----+--------+
|ages|   lname|mname|    name|
+----+--------+-----+--------+
|2.5 | 5.5    | 2.75 |  6    |
+----+--------+-----+--------+

score 4 · Answer 1 · edited Dec 06 '18 at 22:47

This is actually pretty straight forward. We will be using a projection for column length and an aggregation for avg :

from pyspark.sql.functions import length, col, avg

selection = ['lname','mname','name']

schemaPeople \
    .select(*(length(col(c)).alias(c) for c in selection)) \
    .agg(*(avg(col(c)).alias(c) for c in selection)).show()

# +-----+-----+----+
# |lname|mname|name|
# +-----+-----+----+
# |  5.5| 2.75| 6.0|
# +-----+-----+----+

This way, you'll be able to pass the names of the columns dynamically. What we are doing here is actually unpacking the argument list (selection)

Reference : Control Flow Tools - Unpacking Argument Lists.

Thank you for the edit @pault – eliasah Dec 07 '18 at 08:19 — eliasah, Dec 07 '18 at 08:19

score 1 · Answer 2 · answered Oct 29 '18 at 08:14

I think you can just create new rows for the individual lengths and then just group the dataframe. Then you would end up with something like:

df_new = spark.createDataFrame([
( "25","Ankit","Ankit","Ankit"),( "2.2","Jalfaizy","aa","Jalfaizy"),
("230","saurabh","bb","saurabh") ,( "26","aa","bb","Bala")
], ("age", "lname","mname","name"))

df_new.withColumn("len_age",length(col("age"))).withColumn("len_lname",length(col("lname")))\
.withColumn("len_mname",length(col("mname"))).withColumn("len_name",length(col("name")))\
.groupBy().agg(avg("len_age"),avg("len_lname"),avg("len_mname"),avg("len_name")).show()

Result:

+------------+--------------+--------------+-------------+
|avg(len_age)|avg(len_lname)|avg(len_mname)|avg(len_name)|
+------------+--------------+--------------+-------------+
|         2.5|           5.5|          2.75|          6.0|
+------------+--------------+--------------+-------------+

hi thanks for quick reply,can please suggest dynamic passing column names , in my original problem i have so many columns..here i have mentioned only some columns...if possible please explain with UDF. — Sai, Oct 29 '18 at 08:26

pasha701 · Answer 3 · 2018-10-29T11:08:43.493

0

In Scala can be done in this way, guess, can be converted to Python by author:

val averageColumnList = List("age", "lname", "mname", "name")
val columns = averageColumnList.map(name => avg(length(col(name))))
val result = df.select(columns: _*)

edited Oct 29 '18 at 11:08

answered Oct 29 '18 at 08:36

pasha701

6,831
1
15
22

How to find avg length each column in pyspark?

3 Answers3