pyspark Regexp_Extract - Extract multiple words from a string column

Question

I am trying to extract words from a strings column using pyspark regexp.

My DataFrame Below :

ID, Code

10, A1005*B1003

12, A1007*D1008*C1004

result=df.withColumn('Code1', regexp_extract(col(Code), '\w+',0))

Output :

ID, Code,              Code1, 

10, A1005*B1003,       A1005

12, A1007*D1008*C1004, A1007

result=df.withColumn('Code1', regexp_extract(col(Code), '\w+',0))

Output :

ID, Code,              Code1, 

10, A1005*B1003,       A1005

12, A1007*D1008*C1004, A1007

I want to extract codes from Code column and i want my DataFrame to display as below.

ID, Code,              Code1,  Code2,  Code3

10, A1005*B1003,       A1005,  B1003,  null

12, A1007*D1008*C1004, A1007,  D1008,  C1004

Possible duplicate of [Split Spark Dataframe string column into multiple columns](https://stackoverflow.com/questions/39235704/split-spark-dataframe-string-column-into-multiple-columns) — pault, Jan 03 '19 at 15:44

Psidom · Accepted Answer · 2019-01-03T15:44:57.217

0

Assume your ID column is unique for each row; Here is one way of doing it with split, explode and then pivot:

import pyspark.sql.functions as f

(df.select('ID', 'Code', f.posexplode(f.split('Code', '\\*')))
   .withColumn('pos', f.concat(f.lit('code'), f.col('pos')))
   .groupBy('ID', 'Code').pivot('pos').agg(f.first('col'))
   .show())
+---+-----------------+-----+-----+-----+
| ID|             Code|code0|code1|code2|
+---+-----------------+-----+-----+-----+
| 10|      A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+

Another option without pivoting:

df1 = df.select('ID', 'Code', f.split('Code', '\\*').alias('Codes'))
maxCodes = df1.agg(f.max(f.size('Codes'))).first()[0]      # 3
df1.select(
  'ID', 'Code', 
  *[f.col('Codes').getItem(i).alias(f'Code{i+1}') for i in range(maxCodes)]
).show()
+---+-----------------+-----+-----+-----+
| ID|             Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10|      A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+

edited Jan 03 '19 at 15:44

answered Jan 03 '19 at 15:34

Psidom

209,562
33
339
356

Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7. – Mayan Jan 03 '19 at 15:49
If the word you want to extract contains only digits and letters, you can replace `f.split(...)` in above two options with `f.array_remove(f.split('Code', '\\W+'), '')`, and it should give the result you needed. – Psidom Jan 03 '19 at 16:07
Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004 – Mayan Jan 08 '19 at 14:35

pyspark Regexp_Extract - Extract multiple words from a string column

1 Answers1