Finding lines that start with a digit in Scala using filter() method

Question

I am a python programmer and as the Python API is too slow for my Spark application and decided to port my code to Spark Scala API, to compare the computation time.

I am trying to filter out the lines that start with numeric characters from a huge file using Scala API in Spark. In my file, some lines have numbers and some have words and I want the lines that only have numbers.

So, in my Python application, I have these lines.

l = sc.textFile("my_file_path")
l_filtered = l.filter(lambda s: s[0].isdigit())

which works exactly as I want.

This is what I have tried so far.

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.forall(_.isDigit))

This throws out an error saying that char does not have forall() function.

I also tried taking the first character of the lines using s.take(1) and apply isDigit() function on that in the following way.

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).isDigit)

and this too...

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).Character.isDigit)

This also throws an error.

This is basically a small error and as I am not accustomed to Scala syntax, I am having hard time figuring it out. Any help would be appreciated.

Edit: As answered for this question, I tried writing the function, but I am unable to use that in filter() function in my application. To apply the function for all the lines in the file.

possible duplicate of [How to check to see if a string is a decimal number in Scala](http://stackoverflow.com/questions/9938098/how-to-check-to-see-if-a-string-is-a-decimal-number-in-scala) — Justin Pihony, Sep 25 '15 at 18:40
@JustinPihony I tried applying those answers and you can see them in the question description. I am trying to apply the function in the accepted answer in the link and I am unable to use it in filter() function — Shiva, Sep 25 '15 at 18:47

Ihor Kaharlichenko · Accepted Answer · 2015-09-25T20:09:31.780

6

In Scala indexing syntax uses parens () instead of brackets []. The exact translation of your Python code would be this:

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_(0).isDigit)

A more idiomatic extraction of the first symbol would be using head method:

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.head.isDigit)

Both of these methods would fail if your file contains empty lines.

If that's the case, then you probably want this:

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.map(_.isDigit).getOrElse(false))

UPD.

As curious noted map(predicate).getOrElse(false) on Option could be shortened to exists(predicate):

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.exists(_.isDigit))

edited Sep 25 '15 at 20:09

answered Sep 25 '15 at 19:01

Ihor Kaharlichenko

5,944
1
26
32

Thank you for detailed answer. Inspires me to do a lot of digging around into Scala. – Shiva Sep 25 '15 at 19:09
1

The first two examples are not safe if you have a blank string, First one will give `IndexOutOfBoundException` and second will give `NoSuchElementException` third one is quite safe and good but instead, of doing map on headOption and then doing `getOrElse` you could simply use `exists` method. – curious Sep 25 '15 at 19:56
Thanks for pointing out the `exists` method, I updated the answer to include it. The first two examples were intentionally error prone since the original code was susceptible to empty strings as well. I wanted to gradually _improve_. Also, in large scale (which is where Spark is applied I guess) you may get speed improvements by omitting redundant checks if you are certain about your input data format. – Ihor Kaharlichenko Sep 25 '15 at 20:12

score 2 · Answer 2 · answered Sep 25 '15 at 19:33

You can use regular expressions:

scala> List("1hello","2world","good").filter(_.matches("^[0-9].*$"))
res0: List[String] = List(1hello, 2world)

or you can do like this with lesser no. of operations as this file might contain a huge number of lines to filter.

scala> List("1hello","world").filter(_.headOption.exists(_.isDigit))
res1: List[String] = List(1hello)

replace List[String] with your lines l in your case to work.

Finding lines that start with a digit in Scala using filter() method

2 Answers2