I want to read only line that start with a specific regular expression.
val rawData = spark.read.textFile(file.path).filter(f => f.nonEmpty && f.length > 1 && f.startsWith("("))
is what I did until now.
Now I found out that I have entries starting with:
(W);27536- or (W) 28325- (5 digits after seperator)
I only want to read lines that start with (W);1234- (4 digits after seperator)
The regular expression that would catch this look like: \(\D\)(;|\s)\d{4}
for a boolean return or \(\D\)(;|\s)\d{4}-.*
for a string match return
My problem now is that I don't know how to include the regular expression in my read.textFile command.
f.startswith only works with strings
f.matches also only works with strings
I also tried using http://www.scala-lang.org/api/2.12.3/scala/util/matching/Regex.html but this returns a string and not a boolean, which I can not use in the filter function
Any help would be appreciated.