This is a small example of a pyspark column (String) in my dataframe.
column | new_column
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Hoy es día de ABC/KE98789T983456 clase. | 98789
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Como ABC/KE 34562Z845673 todas las mañanas | 34562
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Hoy tiene ABC/KE 110330/L63868 clase de matemáticas, | 110330
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Marcos se ABC 898456/L56784 levanta con sueño. | 898456
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Marcos se ABC898456 levanta con sueño. | 898456
------------------------------------------------------------------------------------------------- |--------------------------------------------------
comienza ABC - KE 60014 -T60058 | 60014
------------------------------------------------------------------------------------------------- |--------------------------------------------------
inglés y FOR 102658/L61144 ciencia. Se viste, desayuna | 102658
------------------------------------------------------------------------------------------------- |--------------------------------------------------
y comienza FOR ABC- 72981 / KE T79581: el camino hacia la | 72981
------------------------------------------------------------------------------------------------- |--------------------------------------------------
escuela. Se FOR ABC 101665 - 103035 - 101926 - 105484 - 103036 - 103247 - encuentra con su | [101665,103035,101926,105484,103036,103247]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
escuela ABCS 206048/206049/206050/206051/205225-FG-matemáticas- | [206048,206049,206050,206051,205225]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
encuentra ABCS 111553/L00847 & 111558/L00895 - matemáticas | [111553, 111558]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ciencia ABC 163278/P20447 AND RETROFIT ABCS 164567/P21000 - 164568/P21001 - desayuna | [163278,164567,164568 ]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ABC/KE 71729/T81672 - 71781/T81674 71782/T81676 71730/T81673 71783/T81677 71784/T | [71729,71781,71782,71730,71783,71784]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ciencia ABC/KE2646/L61175:E/F-levanta con sueño L61/62LAV AT Z5CTR/XC D3-1593 | [2646]
-----------------------------------------------------------------------------------------------------------------------------------------------------
escuela ABCS 6048/206049/6050/206051/205225-FG-matemáticas- MSN 2345 | [6048,206049,6050,206051,205225]
-----------------------------------------------------------------------------------------------------------------------------------------------------
FOR ABC/KE 109038_L35674_DEFINE AND DESIGN IMPROVEMENTS OF 1618 FROM 118(PDS4 BRACKETS) | [109038]
-----------------------------------------------------------------------------------------------------------------------------------------------------
y comienza FOR ABC- 2981 / KE T79581: el camino hacia la 9856 | [2981]
I want to extract all numbers that contain: 4, 5 or 6 digits from this text. Condition and cases to extract them:
- Attached to ABC/KE (first line in the example above).
- after ABC/KE + space (second and third line).
- after ABC + space (line 4)
- after ABC without space (line 5)
- after ABC - KE + space
- after for word
- after ABC- + space
- after ABC + space
- after ABCS (line 10 and 11)
Example of failed cases:
Column | new_column
------------------------------------------------------------------------------------------------------------------------
FOR ABC/KE 109038_L35674_DEFINE AND DESIGN IMPROVEMENTS OF 1618 FROM 118(PDS4 BRACKETS) | [1618] ==> should be [109038]
------------------------------------------------------------------------------------------------------------------------
ciencia ABC/KE2646/L61175:E/F-levanta con sueño L61/62LAV AT Z5CTR/XC D3-1593 | [1593] ==> should be [2646]
------------------------------------------------------------------------------------------------------------------------
escuela ABCS 6048/206049/6050/206051/205225-FG-matemáticas- MSN 2345 | [6048,206049,6050,206051,205225, 2345] ==> should be [6048,206049,6050,206051,205225]
I hope that I resumed the cases, you can see my example above and the expect output. How can I do it ? Thank you