Regex with pyspark

Question

I am really new to pyspark, so here is a really basic question: So I have a Dataframe which looks like this:

|I  27-May-18 10:1...|false|
|I  27-May-18 10:1...|false|
|I  27-May-18 10:1...|false|
|I  27-May-18 10:1...|false|
|I  27-May-18 10:1...|false|
|W  27-May-18 10:1...|false|
|                 ...|false| ##this one should not be flagged
|W  27-May-18 10:1...|false|

And I want to join all following rows together, if there is not W or I or E or U in the beginning so it should look like this afterwards :

|I  27-May-18 10:1...|false|
|I  27-May-18 10:1...|false|    
|I  27-May-18 10:1...|false|    
|I  27-May-18 10:1...|false|    
|I  27-May-18 10:1...|false|    
|W  27-May-18 10:1......|false| ##the row after this one was joined to the one before    
|W  27-May-18 10:1...|false|

For that I thought that I flag the rows, somehow assign groups to the rows and then use a group by statement.

However I am already stuck at flagging the rows, because the regular expression does not work:

So the regular expression for that would be: '^[EUWI]\s'

When I use it in pyspark it will return everything false...

here the code:

df_with_x5 = a_7_df.withColumn("x5", a_7_df.line.startswith("[EUWI]\s"))
##I am using start with thats why i can drop the `^`

Why does it not take my regular expression?

It does not work because [`.startswith`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=regex#pyspark.sql.Column.startswith) does not accept a regex. — Wiktor Stribiżew, Jul 31 '18 at 08:13
Thanks... is rlike a good alternative, with that it works at least — Mimi Müller, Jul 31 '18 at 08:15
Yes, `rlike` accepts a regex. It also allows partial matches. — Wiktor Stribiżew, Jul 31 '18 at 08:17
do you know how to match it now each true and the following false rows with a unique number? — Mimi Müller, Jul 31 '18 at 08:19
@MimiMüller please read [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and try to explain in more detail what your desired output is and what the logic is to achieve it. — pault, Jul 31 '18 at 14:29

Ala Tarighati · Accepted Answer · 2018-07-31T14:42:41.443

0

if you want to create a flag column, you can try substring:

import pyspark.sql.functions as F

df=df.withColumn('flag', F.substring(df.columnName,1,1).isin(['W', 'I', 'E', 'U'])

it checks the first letter only.

But you can discard creating a new column and directly filter rows:

df=df.filter(F.substring(df.columnName,1,1).isin(['W', 'I', 'E', 'U']==False)

edited Jul 31 '18 at 14:42

answered Jul 31 '18 at 14:36

Ala Tarighati

3,507
5
17
34

Regex with pyspark

1 Answers1