0

I have a text file that looks like this:

ABC gibberish
DEF gibberish
ABC text
DEF random

I only want to keep the lines that start with ABC. This is what I've tried:

val lines = sc.textFile("textfile.txt")
val reg = "^ABC".r
val abc_lines = lines.filter(x => reg.pattern.matcher(x).matches)
abc_lines.count()

The count returns 0 so nothing matches, where did I go wrong?

Stanko
  • 4,275
  • 3
  • 23
  • 51

3 Answers3

3

You don't need a regex for this, you can just the startsWith method.

val abc_lines = lines.filter(x => x.startsWith("ABC"))
Ryan Widmaier
  • 7,948
  • 2
  • 30
  • 32
1

Because method matches is not doing what you expect(please, see in documentation).

You can try this snippet to understand

val list = List("ABC", "DEF gibberish", "ABC text", "DEF random")
val reg = "^ABC".r
val lines: Seq[String] = list.filter(x => reg.pattern.matcher(x).matches)
println(lines.size)

Instead, you can use this code:

val list2 = List("ABC", "DEF gibberish", "ABC text", "DEF random")
val lines2: Seq[String] = list.filter(reg.findFirstIn(_).isDefined)
println(lines2.size)

You can find more info here - Matching against a regular expression in Scala

1

you can use findFirstIn method of regex as following

val abc_lines = lines.filter(x => "^ABC".r.findFirstIn(x) == Some("ABC"))

which should give you the correct result.

doing as the following would give you Task not serializable error message in spark

val reg = "^ABC".r
val abc_lines = lines.filter(x => reg.findFirstIn(x) == Some("ABC"))
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97