PySpark Performance slow in Reading large fixed width file with long lines to convert to structural

Question

I am trying to convert bit large file 34GB fixed width file into structural format using pySpark, But my job taking too long to complete (Almost 10 hr+), File having large line almost 50K characters which I am trying to split using substring into around 5k columns, and storing it into parquet format table. if anyone faced similar issues and resolved, any suggestion are greatly appreciated. We have Spark 3.1.1 running through google's Spark Kubernetes Operator on Openshift cluster.

Can you provide a snippet of the code you are trying to execute? And maybe an example of the input and expected output? And can you share your cluster size? — Kevin Barranco, Mar 04 '23 at 14:24
@KevinBarranco Thanks for responding, scenario is exactly same as [this one](https://stackoverflow.com/questions/41944689/pyspark-parse-fixed-width-text-file) and logic works with substring operation on line, concern is its big line and lot of columns makes it running very slow. — Sanjay Bagal, Mar 06 '23 at 14:53
I see. Using your example, I tried using a regex to create a fake CSV and then split the data: ```df = df.withColumn("value", F.regexp_replace("value", "^([A-z0-9]{3})([A-z0-9]{8})([A-z0-9\s]{3})([A-z0-9]{4})", r"$1,$2,$3,$4")``` With this, you can insert the comma between each data group and then use `F.split` or use the `F.from_csv` functions. It can have a positive impact on the performance. — Kevin Barranco, Mar 06 '23 at 15:18
Thanks @KevinBarranco , I will give a try with actual file and revert with the results, Thanks a lot for your support — Sanjay Bagal, Mar 08 '23 at 01:04

PySpark Performance slow in Reading large fixed width file with long lines to convert to structural

0 Answers0