1

I am a python programmer and as the Python API is too slow for my Spark application and decided to port my code to Spark Scala API, to compare the computation time.

I am trying to filter out the lines that start with numeric characters from a huge file using Scala API in Spark. In my file, some lines have numbers and some have words and I want the lines that only have numbers.

So, in my Python application, I have these lines.

l = sc.textFile("my_file_path")
l_filtered = l.filter(lambda s: s[0].isdigit())

which works exactly as I want.

This is what I have tried so far.

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.forall(_.isDigit))

This throws out an error saying that char does not have forall() function.

I also tried taking the first character of the lines using s.take(1) and apply isDigit() function on that in the following way.

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).isDigit)

and this too...

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).Character.isDigit)

This also throws an error.

This is basically a small error and as I am not accustomed to Scala syntax, I am having hard time figuring it out. Any help would be appreciated.

Edit: As answered for this question, I tried writing the function, but I am unable to use that in filter() function in my application. To apply the function for all the lines in the file.

Community
  • 1
  • 1
Shiva
  • 789
  • 6
  • 15
  • 1
    possible duplicate of [How to check to see if a string is a decimal number in Scala](http://stackoverflow.com/questions/9938098/how-to-check-to-see-if-a-string-is-a-decimal-number-in-scala) – Justin Pihony Sep 25 '15 at 18:40
  • @JustinPihony I tried applying those answers and you can see them in the question description. I am trying to apply the function in the accepted answer in the link and I am unable to use it in filter() function – Shiva Sep 25 '15 at 18:47

2 Answers2

6

In Scala indexing syntax uses parens () instead of brackets []. The exact translation of your Python code would be this:

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_(0).isDigit)

A more idiomatic extraction of the first symbol would be using head method:

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.head.isDigit)

Both of these methods would fail if your file contains empty lines.

If that's the case, then you probably want this:

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.map(_.isDigit).getOrElse(false))

UPD.

As curious noted map(predicate).getOrElse(false) on Option could be shortened to exists(predicate):

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.exists(_.isDigit))
Ihor Kaharlichenko
  • 5,944
  • 1
  • 26
  • 32
  • Thank you for detailed answer. Inspires me to do a lot of digging around into Scala. – Shiva Sep 25 '15 at 19:09
  • 1
    The first two examples are not safe if you have a blank string, First one will give `IndexOutOfBoundException` and second will give `NoSuchElementException` third one is quite safe and good but instead, of doing map on headOption and then doing `getOrElse` you could simply use `exists` method. – curious Sep 25 '15 at 19:56
  • Thanks for pointing out the `exists` method, I updated the answer to include it. The first two examples were intentionally error prone since the original code was susceptible to empty strings as well. I wanted to gradually _improve_. Also, in large scale (which is where Spark is applied I guess) you may get speed improvements by omitting redundant checks if you are certain about your input data format. – Ihor Kaharlichenko Sep 25 '15 at 20:12
2

You can use regular expressions:

scala> List("1hello","2world","good").filter(_.matches("^[0-9].*$"))
res0: List[String] = List(1hello, 2world)

or you can do like this with lesser no. of operations as this file might contain a huge number of lines to filter.

scala> List("1hello","world").filter(_.headOption.exists(_.isDigit))
res1: List[String] = List(1hello)

replace List[String] with your lines l in your case to work.

curious
  • 2,908
  • 15
  • 25