Given the below data frame, i wanted to split the numbers column into an array of 3 characters per element of the original number in the array
Given data frame :
+---+------------------+
| id| numbers|
+---+------------------+
|742| 000000000|
|744| 000000|
|746|003000000000000000|
+---+------------------+
Expected dataframe :
+---+----------------------------------+
| id| numbers |
+---+----------------------------------+
|742| [000, 000, 000] |
|744| [000, 000] |
|746| [003, 000, 000, 000, 000, 000] |
+---+----------------------------------+
I tried different regular expressions while using the split
function given below the with the regex that I felt should have worked on the very first try:
import pyspark.sql.functions as f
df = spark.createDataFrame(
[
[742, '000000000'],
[744, '000000'],
[746, '003000000000000000'],
],
["id", "numbers"]
)
df = df.withColumn("numbers", f.split("numbers", "[0-9]{3}"))
df.show()
The result however is
+---+--------------+
| id| numbers|
+---+--------------+
|742| [, , , ]|
|744| [, , ]|
|746|[, , , , , , ]|
+---+--------------+
I want to understand what I am doing wrong. Is there a possibility of setting the global flag for getting all the matches or have I missed something in the regular expression altogether?