I am fairly new to pyspark and still trying to find my way around it. Kindly help, I am trying to extract data from a flat file using pyspark and assign it column names using the length and position. Then later I have to convert it to parquet format on azure synapse. I have been struggling with getting it right.
I have used the suggestions provided here https://stackoverflow.com/a/74074532/16562593 to come up with the following code:
df = spark.read.text("PMNTINSTR_RES_24052023163601_00010.txt")
col_idx = {"MDRecordType": {"Position": 1, "Length":2},\
"Channel": {"Position": 3, "Length":10},\
"MessageType": {"Position": 13, "Length":20},\
"ISOCountryCode": {"Position": 33, "Length":2},\
"SWIFTBIC": {"Position": 35, "Length":11},\
"InterchangeId": {"Position": 46, "Length":31},\
"ProcessingIndicator": {"Position": 77, "Length":10},\
"TimeStamp": {"Position": 87, "Length":20},\
"INRecordType": {"Position": 1, "Length":2},\
"InstructionId": {"Position": 3, "Length":31},\
"InterchangeId": {"Position": 34, "Length":31},\
"Source": {"Position": 75, "Length":10},\
"SenderBIC": {"Position": 85, "Length":11}}
df.select(*map(lambda x: trim(df.value.substr(col_idx[x]['Position'], col_idx[x] ['Length'])).alias(x), col_idx)).show()
and fixed width file looks like this (I have truncated some of the rows for practicality)
MDDUMMY FILEFEEDBACK ZAXXXXXXXX 49752700 20230524163505
IN49752700 DUMMY XXXXXXXX 20230524163505 1002
IS49752700 49752700 20230524163505 20230524T 0000100001
The Output is
+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+------+
|MDRecordType| Channel| MessageType|ISOCountryCode| SWIFTBIC| InterchangeId|ProcessingIndicator| TimeStamp|INRecordType| InstructionId|Status|Source|
+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+------+
| MD| DUMMY|FILEFEEDBACK| ZA| XXXXXXXX|XXXXXXXXX 49752700| |20230524163505| MD|DUMMY FILEFEE...| | 202|
| IN|49752700| | C|PARSDUMMY| CPARSDUMMY| | | IN| 49752700| | |
Desired is
+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+---------+
|MDRecordType| Channel| MessageType|ISOCountryCode| SWIFTBIC| InterchangeId|ProcessingIndicator| TimeStamp|INRecordType| InstructionId|Source|SenderBIC|
+------------+--------+------------+--------------+---------+--------------------+-------------------+--------------+------------+--------------------+------+---------+
| MD| DUMMY|FILEFEEDBACK| ZA| XXXXXXXX| 49752700| |20230524163505| IN| 49752700|DUMMY | XXXXXXXX|
Please assist how can I go about it to read all the records correctly to get to the desired output