This is how you could do this in pyspark:
Specify the separator and the quote on read
If you're reading the data from a file, you can use spark.read_csv
with the following arguments:
df = spark.read.csv("path/to/file", sep=":", quote="#", header=True)
df.show()
#+-----+-------+-----+
#| Name|Roll_no|Class|
#+-----+-------+-----+
#|ab:cd| 23| C|
#|sd:ps| 34| A|
#|ra:kh| 14| H|
#|ku:pa| 36| S|
#|ra:sh| 50| P|
#+-----+-------+-----+
Use Regular Expressions
If you're unable to change the way the data is read and you're starting with the DataFrame shown in the question, you can use regular expressions to get the desired output.
First get the new column names by splitting the existing column name on ":"
new_columns = df.columns[0].split(":")
print(new_columns)
#['Name', 'Roll_no', 'Class']
For the Name
column you need to extract the data between the #
s. For the other two columns, you need to remove the strings between the #
s (and the following ":"
) and use pyspark.sql.functions.split
to extract the components
from pyspark.sql.functions import regexp_extract, regexp_replace, split
df.withColumn(new_columns[0], regexp_extract(df.columns[0], r"(?<=#).+(?=#)", 0))\
.withColumn(new_columns[1], split(regexp_replace(df.columns[0], "#.+#:", ""), ":")[0])\
.withColumn(new_columns[2], split(regexp_replace(df.columns[0], "#.+#:", ""), ":")[1])\
.select(*new_columns)\
.show()
#+-----+-------+-----+
#| Name|Roll_no|Class|
#+-----+-------+-----+
#|ab:cd| 23| C|
#|sd:ps| 34| A|
#|ra:kh| 14| H|
#|ku:pa| 36| S|
#|ra:sh| 50| P|
#+-----+-------+-----+