1

I have created data frame like below:

from pyspark.sql import Row
l = [('Ankit','25','Ankit','Ankit'),('Jalfaizy','2.2','Jalfaizy',"aa"),('saurabh','230','saurabh',"bb"),('Bala','26',"aa","bb")]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], ages=x[1],lname=x[2],mname=x[3]))
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.show()

+----+--------+-----+--------+
|ages|   lname|mname|    name|
+----+--------+-----+--------+
|  25|   Ankit|Ankit|   Ankit|
| 2.2|Jalfaizy|   aa|Jalfaizy|
| 230| saurabh|   bb| saurabh|
|  26|      aa|   bb|    Bala|
+----+--------+-----+--------+

I want find each column avg length for all comuns i.e below my expected output.i.e total number of character in particular column/ number of rows

+----+--------+-----+--------+
|ages|   lname|mname|    name|
+----+--------+-----+--------+
|2.5 | 5.5    | 2.75 |  6    |
+----+--------+-----+--------+
Sai
  • 1,075
  • 5
  • 31
  • 58

3 Answers3

4

This is actually pretty straight forward. We will be using a projection for column length and an aggregation for avg :

from pyspark.sql.functions import length, col, avg

selection = ['lname','mname','name']

schemaPeople \
    .select(*(length(col(c)).alias(c) for c in selection)) \
    .agg(*(avg(col(c)).alias(c) for c in selection)).show()

# +-----+-----+----+
# |lname|mname|name|
# +-----+-----+----+
# |  5.5| 2.75| 6.0|
# +-----+-----+----+

This way, you'll be able to pass the names of the columns dynamically. What we are doing here is actually unpacking the argument list (selection)

Reference : Control Flow Tools - Unpacking Argument Lists.

pault
  • 41,343
  • 15
  • 107
  • 149
eliasah
  • 39,588
  • 11
  • 124
  • 154
1

I think you can just create new rows for the individual lengths and then just group the dataframe. Then you would end up with something like:

df_new = spark.createDataFrame([
( "25","Ankit","Ankit","Ankit"),( "2.2","Jalfaizy","aa","Jalfaizy"),
("230","saurabh","bb","saurabh") ,( "26","aa","bb","Bala")
], ("age", "lname","mname","name"))

df_new.withColumn("len_age",length(col("age"))).withColumn("len_lname",length(col("lname")))\
.withColumn("len_mname",length(col("mname"))).withColumn("len_name",length(col("name")))\
.groupBy().agg(avg("len_age"),avg("len_lname"),avg("len_mname"),avg("len_name")).show()

Result:

+------------+--------------+--------------+-------------+
|avg(len_age)|avg(len_lname)|avg(len_mname)|avg(len_name)|
+------------+--------------+--------------+-------------+
|         2.5|           5.5|          2.75|          6.0|
+------------+--------------+--------------+-------------+
gaw
  • 1,960
  • 2
  • 14
  • 18
  • hi thanks for quick reply,can please suggest dynamic passing column names , in my original problem i have so many columns..here i have mentioned only some columns...if possible please explain with UDF. – Sai Oct 29 '18 at 08:26
0

In Scala can be done in this way, guess, can be converted to Python by author:

val averageColumnList = List("age", "lname", "mname", "name")
val columns = averageColumnList.map(name => avg(length(col(name))))
val result = df.select(columns: _*)
pasha701
  • 6,831
  • 1
  • 15
  • 22