I am really new to pyspark, so here is a really basic question: So I have a Dataframe which looks like this:
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|W 27-May-18 10:1...|false|
| ...|false| ##this one should not be flagged
|W 27-May-18 10:1...|false|
And I want to join all following rows together, if there is not W or I or E or U in the beginning so it should look like this afterwards :
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|W 27-May-18 10:1......|false| ##the row after this one was joined to the one before
|W 27-May-18 10:1...|false|
For that I thought that I flag the rows, somehow assign groups to the rows and then use a group by statement.
However I am already stuck at flagging the rows, because the regular expression does not work:
So the regular expression for that would be: '^[EUWI]\s'
When I use it in pyspark it will return everything false...
here the code:
df_with_x5 = a_7_df.withColumn("x5", a_7_df.line.startswith("[EUWI]\s"))
##I am using start with thats why i can drop the `^`
Why does it not take my regular expression?