I have a CSV file of two string columns (term, code). The code column has a special format [num]-[two_letters]-[text]
where the text
can also contain dashes -
. I want to read this file using Spark into a dataframe of exactly four columns (term, num, two_letters, text).
Input
+---------------------------------+
| term | code |
+---------------------------------+
| term01 | 12-AB-some text |
| term02 | 130-CD-some-other-text |
+---------------------------------+
Output
+------------------------------------------+
| term | num | letters | text |
+------------------------------------------+
| term01 | 12 | AB | some text |
| term02 | 130 | CD | some-other-text |
+------------------------------------------+
I can split the code
column into three columns when there is no dash in its text
part, but how can I achieve a solution that tackle all cases (something like get all text after exactly two dashes into one column)?
The code to split a column into three is clarified well in the answer here