Working with fixed-width (flat) files with multiple rows ended by a new line in pyspark

Question

I am fairly new to pyspark and still trying to find my way around it. Kindly help, I am trying to extract data from a flat file using pyspark and assign it column names using the length and position. Then later I have to convert it to parquet format on azure synapse. I have been struggling with getting it right.

I have used the suggestions provided here https://stackoverflow.com/a/74074532/16562593 to come up with the following code:

df = spark.read.text("PMNTINSTR_RES_24052023163601_00010.txt")
col_idx = {"MDRecordType": {"Position": 1, "Length":2},\
    "Channel": {"Position": 3, "Length":10},\
    "MessageType": {"Position": 13, "Length":20},\
    "ISOCountryCode": {"Position": 33, "Length":2},\
    "SWIFTBIC": {"Position": 35, "Length":11},\
    "InterchangeId": {"Position": 46, "Length":31},\
    "ProcessingIndicator": {"Position": 77, "Length":10},\
    "TimeStamp": {"Position": 87, "Length":20},\
    "INRecordType": {"Position": 1, "Length":2},\
    "InstructionId": {"Position": 3, "Length":31},\
    "InterchangeId": {"Position": 34, "Length":31},\
    "Source": {"Position": 75, "Length":10},\
    "SenderBIC": {"Position": 85, "Length":11}}
df.select(*map(lambda x: trim(df.value.substr(col_idx[x]['Position'], col_idx[x]   ['Length'])).alias(x), col_idx)).show()

and fixed width file looks like this (I have truncated some of the rows for practicality)

MDDUMMY      FILEFEEDBACK             ZAXXXXXXXX   49752700                                 20230524163505                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
IN49752700                       DUMMY      XXXXXXXX   20230524163505      1002                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
IS49752700                      49752700                       20230524163505      20230524T                                                                      0000100001

The Output is

+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+------+
|MDRecordType| Channel| MessageType|ISOCountryCode| SWIFTBIC|       InterchangeId|ProcessingIndicator|     TimeStamp|INRecordType|       InstructionId|Status|Source|
+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+------+
|          MD|   DUMMY|FILEFEEDBACK|            ZA| XXXXXXXX|XXXXXXXXX   49752700|                   |20230524163505|          MD|DUMMY     FILEFEE...|      |   202|
|          IN|49752700|            |             C|PARSDUMMY|          CPARSDUMMY|                   |              |          IN|            49752700|      |      |

Desired is

+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+---------+
|MDRecordType| Channel| MessageType|ISOCountryCode| SWIFTBIC|       InterchangeId|ProcessingIndicator|     TimeStamp|INRecordType|       InstructionId|Source|SenderBIC|
+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+---------+
|          MD|   DUMMY|FILEFEEDBACK|            ZA| XXXXXXXX|            49752700|                   |20230524163505|          IN|            49752700|DUMMY | XXXXXXXX|

Please assist how can I go about it to read all the records correctly to get to the desired output

Working with fixed-width (flat) files with multiple rows ended by a new line in pyspark

0 Answers0