Suppose I have a DataFrame df
in pySpark of the following form:
| id | type | description |
| 1 | "A" | "Date: 2018/01/01\nDescr: This is a test des\ncription\n |
| 2 | "B" | "Date: 2018/01/02\nDescr: Another test descr\niption\n |
| 3 | "A" | "Date: 2018/01/03\nWarning: This is a warnin\ng, watch out\n |
which is of course a dummy set, but will suffice for this example.
I have made a regex-statement with named groups that can be used to extract the relevant information from the description-field, something along the lines of:
^(?:(?:Date: (?P<DATE>.+?)\n)|(?:Descr: (?P<DESCR>.+?)\n)|(?:Warning: (?P<WARNING>.+?)\n)+$
again, dummy regex, the actual regex is somewhat more elaborate, but the purpose is to capture three possible groups:
| DATE | DESCR | WARNING |
| 2018/01/01 | This is a test des\ncription | None |
| 2018/01/02 | Another test descr\niption | None |
| 2018/01/03 | None | This is a warnin\ng, watch out |
Now I would want to add the columns that are the result of the regex match to the original DataFrame (i.e. combine the two dummy tables in this question into one).
I have tried several ways to accomplish this, but none have lead to the full solution yet. A thing I've tried is:
def extract_fields(string):
patt = <ABOVE_PATTERN>
result = re.match(patt, string, re.DOTALL).groupdict()
# Actually, a slight work-around is needed to overcome the None problem when
# no match can be made, I'm using pandas' .str.extract for this now
return result
df.rdd.map(lambda x: extract_fields(x.description))
This will yield the second table, but I see no way to combine this with the original columns from df
. I have tried to construct a new Row()
, but then I run into problems with the ordering of columns (and the fact that I cannot hard-code the column names that will be added by the regex groups) that is needed in the Row()
-constructor, resulting in a dataframe that is has the columns all jumbled up. How can I achieve what I want, i.e. one DataFrame with six columns: id
, type
, description
, DATE
, DESCR
and WARNING
?
Remark. Actually, the description field is not just one field, but several columns. Using concat_ws
, I have concatenated these columns into a new columns description
with the description-fields separated with \n
, but maybe this can be incorporated in a nicer way.