Extracting multiple columns from column in PySpark DataFrame using named regex

Question

Suppose I have a DataFrame df in pySpark of the following form:

| id | type | description                                                  |
|  1 | "A"  | "Date: 2018/01/01\nDescr: This is a test des\ncription\n     |
|  2 | "B"  | "Date: 2018/01/02\nDescr: Another test descr\niption\n       |
|  3 | "A"  | "Date: 2018/01/03\nWarning: This is a warnin\ng, watch out\n |

which is of course a dummy set, but will suffice for this example.

I have made a regex-statement with named groups that can be used to extract the relevant information from the description-field, something along the lines of:

^(?:(?:Date: (?P<DATE>.+?)\n)|(?:Descr: (?P<DESCR>.+?)\n)|(?:Warning: (?P<WARNING>.+?)\n)+$

again, dummy regex, the actual regex is somewhat more elaborate, but the purpose is to capture three possible groups:

| DATE       | DESCR                        | WARNING                        |
| 2018/01/01 | This is a test des\ncription | None                           |
| 2018/01/02 | Another test descr\niption   | None                           |
| 2018/01/03 | None                         | This is a warnin\ng, watch out |

Now I would want to add the columns that are the result of the regex match to the original DataFrame (i.e. combine the two dummy tables in this question into one).

I have tried several ways to accomplish this, but none have lead to the full solution yet. A thing I've tried is:

def extract_fields(string):
   patt = <ABOVE_PATTERN>
   result = re.match(patt, string, re.DOTALL).groupdict()
   # Actually, a slight work-around is needed to overcome the None problem when 
   #   no match can be made, I'm using pandas' .str.extract for this now
   return result

df.rdd.map(lambda x: extract_fields(x.description))

This will yield the second table, but I see no way to combine this with the original columns from df. I have tried to construct a new Row(), but then I run into problems with the ordering of columns (and the fact that I cannot hard-code the column names that will be added by the regex groups) that is needed in the Row()-constructor, resulting in a dataframe that is has the columns all jumbled up. How can I achieve what I want, i.e. one DataFrame with six columns: id, type, description, DATE, DESCR and WARNING?

Remark. Actually, the description field is not just one field, but several columns. Using concat_ws, I have concatenated these columns into a new columns description with the description-fields separated with \n, but maybe this can be incorporated in a nicer way.

Why don't u try if re.match() # date store it, again check if match for Desc and finally for others. Return those matches and use them. — pvy4917, Nov 02 '18 at 16:40
You should be able to use `pyspark.sql.functions.regexp_extract` and avoid the `udf`, but I do not believe that supports named groups. You'll have to extract by index. — pault, Nov 02 '18 at 16:59
@Prazy Nice idea, but that only works when the structure of the description field (i.e. which "keys" will occur there) is known and fixed. However, in my dataframe there are six or seven of these fields, all with varying values — konewka, Nov 02 '18 at 16:59
Don't use UDF. There are lots of `regexp_extract()` examples that you can make use of. — pvy4917, Nov 02 '18 at 17:01
@konewka take a look at [this post](https://stackoverflow.com/questions/53072138/splitting-a-dictionary-in-a-pyspark-dataframe-into-individual-columns/53072990#53072990) which does something similar to what you want, particularly the second part which generalizes for multiple patterns and checks for existence. — pault, Nov 02 '18 at 17:11

score 0 · Answer 1 · answered Nov 05 '18 at 20:46

I think you can use Pandas features for this case. Firstly I convert df to rdd to split description field. I pull a Pandas df then I create spark df with using Pandas df. It works regardless of column numbers in description field

>>> import pandas as pd
>>> import re
>>> 
>>> df.show(truncate=False)
+---+----+-----------------------------------------------------------+
|id |type|description                                                |
+---+----+-----------------------------------------------------------+
|1  |A   |Date: 2018/01/01\nDescr: This is a test des\ncription\n    |
|2  |B   |Date: 2018/01/02\nDescr: Another test desc\niption\n       |
|3  |A   |Date: 2018/01/03\nWarning: This is a warnin\ng, watch out\n|
+---+----+-----------------------------------------------------------+

>>> #convert df to rdd
>>> rdd = df.rdd.map(list)
>>> rdd.first()
[1, 'A', 'Date: 2018/01/01\\nDescr: This is a test des\\ncription\\n']
>>> 
>>> #split description field
>>> rddSplit = rdd.map(lambda x: (x[0],x[1],re.split('\n(?=[A-Z])', x[2].encode().decode('unicode_escape'))))
>>> rddSplit.first()
(1, 'A', ['Date: 2018/01/01', 'Descr: This is a test des\ncription\n'])
>>> 
>>> #create empty Pandas df
>>> df1 = pd.DataFrame()
>>> 
>>> #insert rows
>>> for rdd in rddSplit.collect():
...     a = {i.split(':')[0].strip():i.split(':')[1].strip('\n').replace('\n','\\n').strip() for i in rdd[2]}
...     a['id'] = rdd[0]
...     a['type'] = rdd[1]
...     df2 = pd.DataFrame([a], columns=a.keys())
...     df1 = pd.concat([df1, df2])
... 
>>> df1
         Date                         Descr                         Warning  id type
0  2018/01/01  This is a test des\ncription                             NaN   1    A
0  2018/01/02     Another test desc\niption                             NaN   2    B
0  2018/01/03                           NaN  This is a warnin\ng, watch out   3    A
>>>
>>> #create spark df
>>> df3 = spark.createDataFrame(df1.fillna('')).replace('',None)
>>> df3.show(truncate=False)
+----------+----------------------------+------------------------------+---+----+
|Date      |Descr                       |Warning                       |id |type|
+----------+----------------------------+------------------------------+---+----+
|2018/01/01|This is a test des\ncription|null                          |1  |A   |
|2018/01/02|Another test desc\niption   |null                          |2  |B   |
|2018/01/03|null                        |This is a warnin\ng, watch out|3  |A   |
+----------+----------------------------+------------------------------+---+----+

Extracting multiple columns from column in PySpark DataFrame using named regex

1 Answers1