I want to separate the column of a Spark DataFrame into 2 different columns depending on the first appearance of a character, which in this case is an underscore ("_").
I have prepared a 100% reproducible example:
The mock Spark DataFrame is:
df = spark.createDataFrame(
[
(1, 1.8, 'newyork_3434_north'),
(4, 2.6, 'la_432432432_south'),
(6, 3.3, 'boston_234324_east'),
(8, 4.1, 'detroit_6757_west'),
(2, 5.7, 'miami_133123_north'),
(3, 6.2, 'atlanta_093394_west'),
(1, 6.1, 'houston_87342_east')
],
('ranking', "coordenate", "city")
)
The above code creates a table like the following one:
ranking coordenate city
1 1.8 newyork_3434_north
4 2.6 la_432432432_south
6 3.3 boston_234324_east
8 4.1 detroit_6757_west
2 5.7 miami_133123_north
3 6.2 atlanta_093394_west
1 6.1 houston_87342_east
What I would like to do is separating the column city in 2 different columns depending on the location of the first underscore from left to right.
The final desired table would be then something like:
ranking coordenate city code
1 1.8 newyork 3434_north
4 2.6 la 432432432_south
6 3.3 boston 234324_east
8 4.1 detroit 6757_west
2 5.7 miami 133123_north
3 6.2 atlanta 093394_west
1 6.1 houston 87342_east
I have seen several threads about this topic, but they do not speak about the first appearance of a character (link_1, link_2, etc.), but splitting by all specific characters that are on a string; or splitting by specific position of the character in the string.
I have also tried Python Pandas approach but as expected, it is not applicable in PySpark by extension or analogy (link_3)
Thank you in advance for your help.