1

I am really new to pyspark, so here is a really basic question: So I have a Dataframe which looks like this:

|I  27-May-18 10:1...|false|
|I  27-May-18 10:1...|false|
|I  27-May-18 10:1...|false|
|I  27-May-18 10:1...|false|
|I  27-May-18 10:1...|false|
|W  27-May-18 10:1...|false|
|                 ...|false| ##this one should not be flagged
|W  27-May-18 10:1...|false|

And I want to join all following rows together, if there is not W or I or E or U in the beginning so it should look like this afterwards :

|I  27-May-18 10:1...|false|
|I  27-May-18 10:1...|false|    
|I  27-May-18 10:1...|false|    
|I  27-May-18 10:1...|false|    
|I  27-May-18 10:1...|false|    
|W  27-May-18 10:1......|false| ##the row after this one was joined to the one before    
|W  27-May-18 10:1...|false|

For that I thought that I flag the rows, somehow assign groups to the rows and then use a group by statement.

However I am already stuck at flagging the rows, because the regular expression does not work:

So the regular expression for that would be: '^[EUWI]\s'

When I use it in pyspark it will return everything false...

here the code:

df_with_x5 = a_7_df.withColumn("x5", a_7_df.line.startswith("[EUWI]\s"))
##I am using start with thats why i can drop the `^`

Why does it not take my regular expression?

pault
  • 41,343
  • 15
  • 107
  • 149
Mimi Müller
  • 416
  • 8
  • 25
  • 1
    It does not work because [`.startswith`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=regex#pyspark.sql.Column.startswith) does not accept a regex. – Wiktor Stribiżew Jul 31 '18 at 08:13
  • Thanks... is rlike a good alternative, with that it works at least – Mimi Müller Jul 31 '18 at 08:15
  • Yes, `rlike` accepts a regex. It also allows partial matches. – Wiktor Stribiżew Jul 31 '18 at 08:17
  • do you know how to match it now each true and the following false rows with a unique number? – Mimi Müller Jul 31 '18 at 08:19
  • @MimiMüller please read [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and try to explain in more detail what your desired output is and what the logic is to achieve it. – pault Jul 31 '18 at 14:29

1 Answers1

0

if you want to create a flag column, you can try substring:

import pyspark.sql.functions as F

df=df.withColumn('flag', F.substring(df.columnName,1,1).isin(['W', 'I', 'E', 'U'])

it checks the first letter only.

But you can discard creating a new column and directly filter rows:

df=df.filter(F.substring(df.columnName,1,1).isin(['W', 'I', 'E', 'U']==False)
Ala Tarighati
  • 3,507
  • 5
  • 17
  • 34