0

I have a pyspark Dataframe, I would like to improve the regex bellow. I want to add a condition or modify the regex to:

  • recover all number that is attached to a / or letter in the end.

Example of case1:

column_example                                        |   new_column
------------------------------------------------------|-----------------                                       |
mundo por el número de NJU/LOK 12345T98789-hablantes  |   12345
hispanohablantes ZES/UJ86758/L87586:residentes en     |   86758

Example case 2:

  • I shouldn't take the number that arrived after ABC word.

Column Example:

    My_column                                             |         new_column
------------------------------------------------------|---------------------
mundo por el número de ABC 8567 hablantes             |           []
------------------------------------------------------|---------------------
con dominio nativo ABC 987480 millones de personas    |           []
------------------------------------------------------|---------------------
hispanohablantes residentes en ABC98754 otros países  |           []

The following code is:

ptn = re.complie(r'^(?:MOD)?[0-9]{4,6}$')
array_filter = udf(lambda arr: [ x.lstrip('MOD') for x in arr if re.match(ptn, x) ] if type(arr) is list else arr, ArrayType(StringType()))

How can I do it ? Thank you

verojoucla
  • 599
  • 2
  • 12
  • 23
  • So, what is the expected output? – Wiktor Stribiżew Oct 15 '19 at 14:23
  • @WiktorStribiżew I edited my question – verojoucla Oct 15 '19 at 14:26
  • 1
    Try replacing `[ x.lstrip('MOD') for x in arr if re.match(ptn, x) ]` with `re.findall(r"(?<!\S)(?<!\bABC\s)(?:MOD)?([0-9]{4,6})(?!\S)", " ".join(arr))` – Wiktor Stribiżew Oct 15 '19 at 14:27
  • @verojoucla, are you looking for an array column with multiple matched items in the same text? – jxc Oct 15 '19 at 14:51
  • @jxc yes :) in fact I have many conditions in my pattern, each time I add some condition – verojoucla Oct 15 '19 at 15:14
  • Did you finally use my suggestion? – Wiktor Stribiżew Oct 16 '19 at 16:34
  • @WiktorStribiżew, it's actually fine, but I prefer moving `(?:MOD)?` to the front so that the lookbehind anchors sit right to where they should be. AS I mentioned to you, there are confusing about the dataType from us working on pyspark. the task had been overly complex by switching data back and forth between StringType and ArrayType. – jxc Oct 16 '19 at 17:20
  • @WiktorStribiżew, I suggest you formally post your answer here while reopen the new one since it's not the same. Still putting the two look-behind anchors to the front of `(?:MOD)?` might be fine for this task but have potential bugs if you check it carefully. – jxc Oct 16 '19 at 17:50
  • The issue is the same. No need to reopen the same issue. You just need to adjust your answer. – Wiktor Stribiżew Oct 16 '19 at 17:59
  • @WiktorStribiżew I used it in another case, can you give me a solution for my question https://stackoverflow.com/questions/58415461/how-filter-number-from-an-array?noredirect=1#comment103178777_58415461 – verojoucla Oct 16 '19 at 19:07

1 Answers1

2

One way without using udf for Spark before version 2.4.0:

from pyspark.sql.functions import split, regexp_replace

df.withColumn('new_column'
   , split(
       regexp_replace(
           regexp_replace('My_column', r'.*?(?<!ABC\s{0,5})(?<!\d)(\d{4,6})(?=[A-Z/])', '$1\0')
         , '\0?[^\0]*$'
         , ''
       )
     ,'\0')
   ) \
  .show(truncate=False)
+-----------------------------------------------------------------------+--------------+
|My_column                                                              |new_column    |
+-----------------------------------------------------------------------+--------------+
|23458/ mundo por el nmero de NJU/LOK 12345T98789 hablantes             |[23458, 12345]|
|con dominio nativo ABC 987480 millones ZES/UJ86758/L87586:residentes en|[86758]       |
|hispanohablantes  residentes en ABC98754/ otros pases                  |[]            |
+-----------------------------------------------------------------------+--------------+

Where:

  • use regexp_replace: to replace the text matching the following pattern

    .*?(?<!ABC\s{0,5})(?<!\d)(\d{4,6})(?=[A-Z/])
    

with $1\0 which removes all unrelated text before NUMBER_NEEDED(saved in $1) which is not preceded by ABC\s{0,5} and \d but followed by [A-Z/]. put a NULL char \0 at the end of each matched $1.

  • use split(text, '\0') to convert the above text into an array, notice that the last item of the array is irrelevant which should be excluded

  • use another regexp_replace(text, '\0?[^\0]*$', '') to remove the trailing unrelated text before running the above split() function

Notes:

  • (?<!ABC\s{0,5}) will allow to test 0-5 whitespaces between ABC and the NUMBER_NEEDED. since regex negative lookbehind does not support (?<!ABC\s*), if your text might contain more spaces in between, you can adjust 5 to a larger number. BTW. (?<!ABC\s{0,5}) is fine with PySpark but invalid in Python re module which allows only fixed-width pattern

  • prepend (?s) to allow dotall mode if any texts contain line breaks

  • I assumed that the NULL char \0 is not shown in your original texts, since it wont be part of matches, you can remove them all (regexp_replace(text, '\0', '')) before running the above 3 functions.

Another way using udf:

import re
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf

ptn = re.compile(r'(?<!ABC)(?<!\d)(\d{4,6})(?=[A-Z/])')

find_number = udf(lambda x: re.findall(ptn, re.sub(r'(?<=ABC)\s+', '', x)) if x else [], ArrayType(StringType()))

df.withColumn('new_column', find_number('My_column')).show()
jxc
  • 13,553
  • 4
  • 16
  • 34
  • I shouldn't keep them the number in this case. Look my question above. – verojoucla Oct 15 '19 at 15:22
  • @verojoucla, can you update your question and add the expected result from the input text? I though you want all numbers right after 'ABC\s*' – jxc Oct 15 '19 at 15:25
  • @verojoucla the resulting column is ArrayType or StringType? – jxc Oct 15 '19 at 15:42
  • it's an arrayType – verojoucla Oct 15 '19 at 15:43
  • some help for my question here :) knowing that I shouldn't affect the old condition, I want to add a new condition. Thanks https://stackoverflow.com/questions/58415461/how-filter-number-from-an-array/58415992#58415992 – verojoucla Oct 16 '19 at 14:44
  • Hi, @verojoucla the example dataframe shows the column `My_column` as a `StringType` field, but it looks that this is an ArrayType in your task. It would be more helpful if you can use its actual DataType in example. since they could lead to a completely different solution. – jxc Oct 16 '19 at 15:13
  • My_column is an Array of string – verojoucla Oct 16 '19 at 15:14
  • @verojoucla But the result from `df.show()` is a StringType. an ArrayType column would have brackets to enclose the fields, and use `comma` to split array items. the result of df.show() is kind of misleading. it will be better to print its original format, keep the comma and brackets. so the audience knows what is in each array item. or this is an array with only one item. – jxc Oct 16 '19 at 15:20
  • And it's very likely to parse the whole text from StringType into ArrayType in one step instead of doing the split/join/split things. – jxc Oct 16 '19 at 15:21
  • In fact, from the begin I'm working on the split text. I splitted my text then I start regex – verojoucla Oct 16 '19 at 15:22
  • @verojoucla, not sure how you split the text, I guess all can be merged into one udf function. Now, in case of your new question, are those two conditions 'OR' (either one should be a match?) and what is the comma in the text, are they part of text or just used to separate array items?? – jxc Oct 16 '19 at 15:34
  • This is the code how I did the split: df.withColumn('my_column', F.expr('flatten(sentences(column_to_split))')) – verojoucla Oct 16 '19 at 15:48
  • Can you help me please in my question https://stackoverflow.com/questions/58451588/how-extract-numbers-from-text-no-splited?noredirect=1#comment103241222_58451588 – verojoucla Oct 18 '19 at 13:45
  • @verojoucla, your new questions got blocked, I think someone at regex section got mad at you. I suggest you delete the two closed posts and post a new one without regex tag. If you can answer the two questions I asked under that closed question, I will post an answer I have. have a good weekend. – jxc Oct 18 '19 at 21:05
  • this is a new question https://stackoverflow.com/questions/58474791/how-to-extract-number-from-text-pyspark – verojoucla Oct 20 '19 at 16:04