Create multiple rows of fixed length from a data frame column in Pyspark

Question

My input is a dataframe column in pyspark and it has only one column DETAIL_REC.

detail_df.show()

DETAIL_REC
================================
ABC12345678ABC98765543ABC98762345

detail_df.printSchema()
root
|-- DETAIL_REC: string(nullable =true)

For every 11th char/string it has to be in next row of dataframe for downstream process to consume this.

Expected output Should be multiple rows in dataframe

DETAIL_REC (No spaces lines after each record)
==============
ABC12345678
ABC98765543 
ABC98762345

anky · Accepted Answer · 2022-03-04T13:57:19.423

If you have spark 2.4+ version, we can make use of higher order functions to do it like below:

from pyspark.sql import functions as F
n = 11
output = df.withColumn("SubstrCol",F.explode((F.expr(f"""filter(
                                      transform(
                                      sequence(0,length(DETAIL_REC),{n})
                                      ,x-> substring(DETAIL_REC,x+1,{n}))
                                      ,y->y <> '')"""))))

output.show(truncate=False)

+---------------------------------+-----------+
|DETAIL_REC                       |SubstrCol  |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+

Logic used:

First generate a sequence of integers starting from 0 to length of the string in steps of 11 (n)
Using transform iterate through this sequence and keep getting substrings from the original string (This keeps changing the start position.
Filter out any blank strings from the resulting array and explode this array.

For lower versions of spark, use a udf with textwrap or any other functions as addressed here:

from pyspark.sql import functions as F, types as T
from textwrap import wrap
n = 11
myudf = F.udf(lambda x: wrap(x,n),T.ArrayType(T.StringType()))

output = df.withColumn("SubstrCol",F.explode(myudf("DETAIL_REC")))

output.show(truncate=False)

+---------------------------------+-----------+
|DETAIL_REC                       |SubstrCol  |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+

@Kumar I added another solution for lower versions of spark. Please try the same and let me know. — anky, Mar 04 '22 at 13:57

Create multiple rows of fixed length from a data frame column in Pyspark

1 Answers1