1

My input is a dataframe column in pyspark and it has only one column DETAIL_REC.

detail_df.show()

DETAIL_REC
================================
ABC12345678ABC98765543ABC98762345

detail_df.printSchema()
root
|-- DETAIL_REC: string(nullable =true)

For every 11th char/string it has to be in next row of dataframe for downstream process to consume this.

Expected output Should be multiple rows in dataframe

DETAIL_REC (No spaces lines after each record)
==============
ABC12345678
ABC98765543 
ABC98762345 
pltc
  • 5,836
  • 1
  • 13
  • 31
Kumar
  • 37
  • 6

1 Answers1

0

If you have spark 2.4+ version, we can make use of higher order functions to do it like below:

from pyspark.sql import functions as F
n = 11
output = df.withColumn("SubstrCol",F.explode((F.expr(f"""filter(
                                      transform(
                                      sequence(0,length(DETAIL_REC),{n})
                                      ,x-> substring(DETAIL_REC,x+1,{n}))
                                      ,y->y <> '')"""))))

output.show(truncate=False)

+---------------------------------+-----------+
|DETAIL_REC                       |SubstrCol  |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+

Logic used:

  1. First generate a sequence of integers starting from 0 to length of the string in steps of 11 (n)
  2. Using transform iterate through this sequence and keep getting substrings from the original string (This keeps changing the start position.
  3. Filter out any blank strings from the resulting array and explode this array.

For lower versions of spark, use a udf with textwrap or any other functions as addressed here:

from pyspark.sql import functions as F, types as T
from textwrap import wrap
n = 11
myudf = F.udf(lambda x: wrap(x,n),T.ArrayType(T.StringType()))

output = df.withColumn("SubstrCol",F.explode(myudf("DETAIL_REC")))

output.show(truncate=False)

+---------------------------------+-----------+
|DETAIL_REC                       |SubstrCol  |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+
anky
  • 74,114
  • 11
  • 41
  • 70