0

I want to read only line that start with a specific regular expression.

 val rawData = spark.read.textFile(file.path).filter(f => f.nonEmpty && f.length > 1 && f.startsWith("(")) 

is what I did until now.

Now I found out that I have entries starting with: (W);27536- or (W) 28325- (5 digits after seperator)
I only want to read lines that start with (W);1234- (4 digits after seperator)

The regular expression that would catch this look like: \(\D\)(;|\s)\d{4} for a boolean return or \(\D\)(;|\s)\d{4}-.* for a string match return

My problem now is that I don't know how to include the regular expression in my read.textFile command.
f.startswith only works with strings
f.matches also only works with strings

I also tried using http://www.scala-lang.org/api/2.12.3/scala/util/matching/Regex.html but this returns a string and not a boolean, which I can not use in the filter function

Any help would be appreciated.

user2811630
  • 445
  • 3
  • 11

3 Answers3

2

Other answers are over-thinking this. Just use matches

val lineRegex = """\(\D\)(;|\s)\d{4}-.*"""
val ns = List ("(W);1234-something",
               "(W);12345-something",
               "(W);2345-something",
               "(W);23456-something",
               "(W);3456-something",
               "",
               "1" )
ns.filter(f=> f.matches(lineRegex))

results in

List("(W);1234-something", "(W);2345-something", "(W);3456-something")
The Archetypal Paul
  • 41,321
  • 20
  • 104
  • 134
  • 1
    I might add that, depending on the performance sensitivity of this particular step in the whole process, one could prefer sharing a compiled regular expression (calling `compiledPattern.matcher(string).matches()`) rather than `string.matches(pattern)`. In even more intensive scenarios, one would even consider reusing matcher objects, and not only patterns, with the proper thread safety measures (e.g. reuse a matcher inside a spark `rdd.mapPartitions` call) : https://stackoverflow.com/questions/11391337/java-pattern-matcher-create-new-or-reset – GPI Sep 27 '17 at 12:16
1

I found the answer to my question.

The command needs to look like this.

 val lineregex = """\(\D\)(;|\s)\d{4}-.*""".r

 val rawData = spark.read.textFile(file.path)
  .filter(f => f.nonEmpty && f.length > 1 && lineregex.unapplySeq(f).isDefined )
user2811630
  • 445
  • 3
  • 11
  • 1
    No, you don't. You can just use `f.matches(lineregex)` (the `nonEmpty` and `length` tests are redundant since if they fail, so will the regexp match – The Archetypal Paul Sep 27 '17 at 09:16
  • Yeah code is redundant now, but why should this code not work correctly? – user2811630 Sep 27 '17 at 10:39
  • Didn't say it wouldn't work. I was commenting on the "needs to look like this". No, it doesn't - `matches` is simpler and clearer. – The Archetypal Paul Sep 27 '17 at 10:41
  • Okay thanks. I am now using matches. Thought it did not work since I always put a .r at the end of lineregex. But this is not neccessary. I am now running it with your matches command.. thx – user2811630 Sep 27 '17 at 10:44
0

You can try to find a match of the Regex using the findFirstMatchIn method, which returns an Option[Match]:

spark.read.textFile(file.path).filter { line =>
  line.nonEmpty &&
  line.length > 1 &&
  "regex".r.findFirstMatchIn(line).isDefined
}
Miguel
  • 1,201
  • 2
  • 13
  • 30