2

This is a small example of a pyspark column (String) in my dataframe.

 column                                                                                            |   new_column
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Hoy es día de ABC/KE98789T983456 clase.                                                           |    98789
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Como ABC/KE 34562Z845673 todas las mañanas                                                        |    34562
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Hoy tiene ABC/KE 110330/L63868 clase de matemáticas,                                              |    110330
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Marcos se ABC 898456/L56784 levanta con sueño.                                                    |    898456
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Marcos se ABC898456 levanta con sueño.                                                            |    898456
------------------------------------------------------------------------------------------------- |--------------------------------------------------
comienza ABC - KE 60014 -T60058                                                                   |    60014
------------------------------------------------------------------------------------------------- |--------------------------------------------------
inglés y FOR 102658/L61144 ciencia. Se viste, desayuna                                            |    102658
------------------------------------------------------------------------------------------------- |--------------------------------------------------
y comienza FOR ABC- 72981 / KE T79581: el camino hacia la                                         |    72981
------------------------------------------------------------------------------------------------- |--------------------------------------------------
escuela. Se FOR ABC 101665 - 103035 - 101926 - 105484 - 103036 - 103247 - encuentra con su        |    [101665,103035,101926,105484,103036,103247]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
escuela ABCS 206048/206049/206050/206051/205225-FG-matemáticas-                                   |    [206048,206049,206050,206051,205225]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
encuentra ABCS 111553/L00847 & 111558/L00895 - matemáticas                                        |    [111553, 111558]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ciencia ABC 163278/P20447 AND RETROFIT ABCS 164567/P21000 - 164568/P21001 - desayuna              |    [163278,164567,164568 ]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ABC/KE 71729/T81672 - 71781/T81674 71782/T81676 71730/T81673 71783/T81677 71784/T                 |    [71729,71781,71782,71730,71783,71784]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ciencia ABC/KE2646/L61175:E/F-levanta con sueño L61/62LAV AT Z5CTR/XC D3-1593                     |    [2646]
-----------------------------------------------------------------------------------------------------------------------------------------------------
escuela ABCS 6048/206049/6050/206051/205225-FG-matemáticas- MSN 2345                              |    [6048,206049,6050,206051,205225]
-----------------------------------------------------------------------------------------------------------------------------------------------------
FOR ABC/KE 109038_L35674_DEFINE AND DESIGN IMPROVEMENTS OF 1618 FROM 118(PDS4 BRACKETS)           |   [109038]
-----------------------------------------------------------------------------------------------------------------------------------------------------
y comienza FOR ABC- 2981 / KE T79581: el camino hacia la 9856                                     |   [2981]

I want to extract all numbers that contain: 4, 5 or 6 digits from this text. Condition and cases to extract them:

- Attached to ABC/KE (first line in the example above).
- after ABC/KE + space (second and third line).
- after ABC + space (line 4)
- after ABC without space (line 5)
- after ABC - KE + space
- after for word
- after ABC- + space
- after ABC + space
- after ABCS (line 10 and 11)

Example of failed cases:

                                    Column                                                    |   new_column
------------------------------------------------------------------------------------------------------------------------
FOR ABC/KE 109038_L35674_DEFINE AND DESIGN IMPROVEMENTS OF 1618 FROM 118(PDS4 BRACKETS)           |   [1618]   ==> should be [109038]
------------------------------------------------------------------------------------------------------------------------
ciencia ABC/KE2646/L61175:E/F-levanta con sueño L61/62LAV AT Z5CTR/XC D3-1593                     |   [1593]  ==> should be [2646]
------------------------------------------------------------------------------------------------------------------------
escuela ABCS 6048/206049/6050/206051/205225-FG-matemáticas- MSN 2345                              |    [6048,206049,6050,206051,205225, 2345]  ==> should be [6048,206049,6050,206051,205225]

I hope that I resumed the cases, you can see my example above and the expect output. How can I do it ? Thank you

verojoucla
  • 599
  • 2
  • 12
  • 23

1 Answers1

3

One way using regexes to clean out the data and set up a lone anchor with value of ABC to identify the start of a potential match. after str.split(), iterate through the resulting array to flag and retrieve consecutive matching numbers that follow this anchor.

Edit: Added underscore _ into the data pattern (\b(\d{4,6})(?=[A-Z/_]|$)) so that it now allows underscore as an anchor to follow the matched substring of 4-6 digit. this fixed the first line, line 2 and 3 should be working with the existing regex patterns.

import re
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf

(1) Use regex patterns to clean out the raw data so that we have only one anchor ABC to identify the start of a potential match:

  • clean1: use [-&\s]+ to convert '&', '-' and whitespaces to a SPACE ' ', they are used to connect a chain of numbers

    example: `ABC - KE`  -->  `ABC KE`
             `103035 - 101926 - 105484` -> `103035 101926 105484`
             `111553/L00847 & 111558/L00895` -> `111553/L00847 111558/L00895`
    
  • clean2: convert text matching the following three sub-patterns into 'ABC '

    + ABCS?(?:[/\s]+KE|(?=\s*\d))
      + ABC followed by an optional `S`
        + followed by at least one slash or whitespace and then `KE` --> `[/\s]+KE`
          example: `ABC/KE 110330/L63868` to `ABC  110330/L63868`
        + or followed by optional whitespaces and then at least one digit  --> (?=\s*\d)
          example: ABC898456 -> `ABC 898456`
    + \bFOR\s+(?:[A-Z]+\s+)*
      + `FOR` words
          example: `FOR DEF HJK 12345` -> `ABC 12345`
    
  • data: \b(\d{4,6})(?=[A-Z/_]|$) is a regex to match actual numbers: 4-6 digits followed by [A-Z/] or end_of_string

(2) Create a dict to save all 3 patterns:

ptns = {
    'clean1': re.compile(r'[-&\s]+', re.UNICODE)
  , 'clean2': re.compile(r'\bABCS?(?:[/\s-]+KE|(?=\s*\d))|\bFOR\s+(?:[A-Z]+\s+)*', re.UNICODE)
  , 'data'  : re.compile(r'\b(\d{4,6})(?=[A-Z/_]|$)', re.UNICODE)
}

(3) Create a function to find matched numbers and save them into an array

def find_number(s_t_r, ptns, is_debug=0):
  try:
    arr = re.sub(ptns['clean2'], 'ABC ', re.sub(ptns['clean1'], ' ', s_t_r.upper())).split()
    if is_debug: return arr
    # f: flag to identify if a chain of matches is started, default is 0(false)
    f = 0
    new_arr = []
    # iterate through the above arr and start checking numbers when anchor is detected and set f=1
    for x in arr:
      if x == 'ABC':
        f = 1
      elif f:
        new = re.findall(ptns['data'], x)
        # if find any matches, else reset the flag
        if new:
          new_arr.extend(new)
        else:
          f = 0
    return new_arr
  except Exception as e:
    # only use print in local debugging
    print('ERROR:{}:\n  [{}]\n'.format(s_t_r, e))
    return []

(4) defind the udf function

udf_find_number = udf(lambda x: find_number(x, ptns), ArrayType(StringType()))

(5) get the new_column

df.withColumn('new_column', udf_find_number('column')).show(truncate=False)
+------------------------------------------------------------------------------------------+------------------------------------------------+
|column                                                                                    |new_column                                      |
+------------------------------------------------------------------------------------------+------------------------------------------------+
|Hoy es da de ABC/KE98789T983456 clase.                                                    |[98789]                                         |
|Como ABC/KE 34562Z845673 todas las ma?anas                                                |[34562]                                         |
|Hoy tiene ABC/KE 110330/L63868 clase de matem篓垄ticas,                                    |[110330]                                        |
|Marcos se ABC 898456/L56784 levanta con sue?o.                                            |[898456]                                        |
|Marcos se ABC898456 levanta con sue?o.                                                    |[898456]                                        |
|comienza ABC - KE 60014 -T60058                                                           |[60014]                                         |
|ingl篓娄s y FOR 102658/L61144 ciencia. Se viste, desayuna                                  |[102658]                                        |
|y comienza FOR ABC- 72981 / KE T79581: el camino hacia la                                 |[72981]                                         |
|escuela. Se FOR ABC 101665 - 103035 - 101926 - 105484 - 103036 - 103247 - encuentra con su|[101665, 103035, 101926, 105484, 103036, 103247]|
|escuela ABCS 206048/206049/206050/206051/205225-FG-matem篓垄ticas-                         |[206048, 206049, 206050, 206051, 205225]        |
|encuentra ABCS 111553/L00847 & 111558/L00895 - matem篓垄ticas                              |[111553, 111558]                                |
|ciencia ABC 163278/P20447 AND RETROFIT ABCS 164567/P21000 - 164568/P21001 - desayuna      |[163278, 164567, 164568]                        |
|ABC/KE 71729/T81672 - 71781/T81674 71782/T81676 71730/T81673 71783/T81677 71784/T         |[71729, 71781, 71782, 71730, 71783, 71784]      |
+------------------------------------------------------------------------------------------+------------------------------------------------+

(6) code for debugging, use find_number(row.column, ptns, 1) to check how/if the first two regex patterns work as expected:

for row in df.limit(10).collect():
  print('{}:\n    {}\n'.format(row.column, find_number(row.column, ptns)))

Some Notes:

  • in clean2 pattern, ABCS and ABS are treated the same way. if they are different, just remove the 'S' and add a new alternative ABCS\s*(?=\d) to the end of the pattern

    re.compile(r'\bABC(?:[/\s-]+KE|(?=\s*\d))|\bFOR\s+(?:[A-Z]+\s+)*|ABCS\s*(?=\d)')
    
  • current pattern clean1 only treats '-', '&' and whitespaces as consecutive connector, you might add more characters or words like 'and', 'or', for example:

    re.compile(r'[-&\s]+|\b(?:AND|OR)\b')
    
  • FOR words is \bFOR\s+(?:[A-Z]+\s+)*, this might be adjusted based on if numbers are allowed in words etc.

  • This was tested on Python-3. using Python-2, there might be issue with unicode, you can fix it by using the method in the first answer of reference

jxc
  • 13,553
  • 4
  • 16
  • 34
  • Hi @jxc, It's working for 5 and 6 digits thanks, but no for the 4 digits. I got a wrong numbers that contain 4 digits yes, but doesn't respect the pattern conditions like 5 and 6 digits. I'm sorry fir this delay, some help please ? Thank you – verojoucla Nov 06 '19 at 09:45
  • @verojoucla, can yo add all samples that failed, I will need to check what exactly happened. – jxc Nov 06 '19 at 12:20
  • Comment is corrected, sorry :) The idea is always I should respect the existing of the words ABC, ABCS, FOR and KE. I edited the question, I added 3 lines in the end of the example of a dataset. And I added an example of the failed cases . Thank you very much – verojoucla Nov 06 '19 at 13:09
  • 1
    @verojoucla, I tested te 3 new samples, 2nd and 3rd ones should be working. and the 1st one, just need a small adjustment change the pattern for data to: , `'data' : re.compile(r'\b(\d{4,6})(?=[A-Z/_]|$)', re.UNICODE)`. I just added so that the 4-6 numbers must be followed by a uppercase letter, a slash or underscore (underscore was missing from previous one). that should fix the issue. – jxc Nov 06 '19 at 15:41
  • Thanks a lot it's worked well. And I have posted a new question, I want just to add some conditions to this pattern, don't touch to the old cases. So may be you only can help again :) Thank you https://stackoverflow.com/questions/58755555/modify-a-pattern-to-extract-number-from-text – verojoucla Nov 07 '19 at 19:20