Read only lines that start with specific regular expression

Question

I want to read only line that start with a specific regular expression.

 val rawData = spark.read.textFile(file.path).filter(f => f.nonEmpty && f.length > 1 && f.startsWith("("))

is what I did until now.

Now I found out that I have entries starting with: (W);27536- or (W) 28325- (5 digits after seperator)
I only want to read lines that start with (W);1234- (4 digits after seperator)

The regular expression that would catch this look like: \(\D\)(;|\s)\d{4} for a boolean return or \(\D\)(;|\s)\d{4}-.* for a string match return

My problem now is that I don't know how to include the regular expression in my read.textFile command.
f.startswith only works with strings
f.matches also only works with strings

I also tried using http://www.scala-lang.org/api/2.12.3/scala/util/matching/Regex.html but this returns a string and not a boolean, which I can not use in the filter function

Any help would be appreciated.

`f.startswith only works with strings;f.matches also only works with strings` Why is this a problem? In your filter, `f` is a string. — The Archetypal Paul, Sep 27 '17 at 09:04
Because the filter function wants a bollean return and both return a string — user2811630, Sep 27 '17 at 09:10

score 2 · Accepted Answer · answered Sep 27 '17 at 09:21

2

Other answers are over-thinking this. Just use matches

val lineRegex = """\(\D\)(;|\s)\d{4}-.*"""
val ns = List ("(W);1234-something",
               "(W);12345-something",
               "(W);2345-something",
               "(W);23456-something",
               "(W);3456-something",
               "",
               "1" )
ns.filter(f=> f.matches(lineRegex))

results in

List("(W);1234-something", "(W);2345-something", "(W);3456-something")

answered Sep 27 '17 at 09:21

The Archetypal Paul

41,321
20
104
134

1

I might add that, depending on the performance sensitivity of this particular step in the whole process, one could prefer sharing a compiled regular expression (calling `compiledPattern.matcher(string).matches()`) rather than `string.matches(pattern)`. In even more intensive scenarios, one would even consider reusing matcher objects, and not only patterns, with the proper thread safety measures (e.g. reuse a matcher inside a spark `rdd.mapPartitions` call) : https://stackoverflow.com/questions/11391337/java-pattern-matcher-create-new-or-reset – GPI Sep 27 '17 at 12:16

user2811630 · Answer 2 · 2017-09-27T10:39:59.143

1

I found the answer to my question.

The command needs to look like this.

 val lineregex = """\(\D\)(;|\s)\d{4}-.*""".r

 val rawData = spark.read.textFile(file.path)
  .filter(f => f.nonEmpty && f.length > 1 && lineregex.unapplySeq(f).isDefined )

edited Sep 27 '17 at 10:39

answered Sep 27 '17 at 09:12

user2811630

445
3
11

1

No, you don't. You can just use `f.matches(lineregex)` (the `nonEmpty` and `length` tests are redundant since if they fail, so will the regexp match – The Archetypal Paul Sep 27 '17 at 09:16
Yeah code is redundant now, but why should this code not work correctly? – user2811630 Sep 27 '17 at 10:39
Didn't say it wouldn't work. I was commenting on the "needs to look like this". No, it doesn't - `matches` is simpler and clearer. – The Archetypal Paul Sep 27 '17 at 10:41
Okay thanks. I am now using matches. Thought it did not work since I always put a .r at the end of lineregex. But this is not neccessary. I am now running it with your matches command.. thx – user2811630 Sep 27 '17 at 10:44

score 0 · Answer 3 · answered Sep 27 '17 at 09:14

0

You can try to find a match of the Regex using the findFirstMatchIn method, which returns an Option[Match]:

spark.read.textFile(file.path).filter { line =>
  line.nonEmpty &&
  line.length > 1 &&
  "regex".r.findFirstMatchIn(line).isDefined
}

answered Sep 27 '17 at 09:14

Miguel

1,201
2
13
30

Read only lines that start with specific regular expression

3 Answers3