Iterate through each column and find the max length

Question

I want to get the maximum length from each column from a pyspark dataframe.

Following is the sample dataframe:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

data2 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)

I tried to implement the solution provided in Scala but could not convert it.

Does this answer work for you? https://stackoverflow.com/a/64675496 — karatekraft, Dec 25 '22 at 19:10

Ronak Jain · Accepted Answer · 2023-02-17T04:25:53.553

4

This would work

from pyspark.sql.functions import col, length, max


df=df.select([max(length(col(name))) for name in df.schema.names])

Result

Edit: For reference: Converting to Rows (As asked here, updated there as well - pyspark max string length for each column in the dataframe)

df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
row=df.first().asDict()
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])

Output:

edited Feb 17 '23 at 04:25

answered Dec 25 '22 at 19:31

Ronak Jain

3,073
1
11
17

How do are you using `row`? I am getting the error: `NameError: name 'row' is not defined`. – nam Feb 17 '23 at 00:42
1

@nam Updated the answer, thanks for pointing out - there was a missing line. – Ronak Jain Feb 17 '23 at 04:28

Iterate through each column and find the max length

1 Answers1

Linked