k-means Clustering geolocated data using Spark/Scala

Question

How to Handle geolocated data using k-means cluster algorithm here, Can somebody please share your input here, Thanks in advance.

 Project_2_Dataset.txt file entries look like this 
 =================================================

            33.68947543 -117.5433083
            37.43210889 -121.4850296
            39.43789083 -120.9389785
            39.36351868 -119.4003347
            33.19135811 -116.4482426
            33.83435437 -117.3300009

    Please review my Code here:
    ============================         
    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.mllib.clustering.KMeans
    val data = sc.textFile("Project_2_Dataset.txt")             
    val parsedData = data.map( line => Vectors.dense(line.split(',').map(_.toDouble)))
    val kmmodel= KMeans.train(parsedData,3,5) --- 3 clusters,4 Iterations.
    17/06/17 13:12:20 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
    java.lang.NumberFormatException: For input string: "33.68947543 -117.5433083"
            at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
            at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
            at java.lang.Double.parseDouble(Double.java:538)
            at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)

Thanks Amit K

score 0 · Answer 1 · answered Jun 18 '17 at 18:53

0

I think it is because you try to split each line at char ',' instead of ' '.

@ "33.19135811 -116.4482426".toDouble 
java.lang.NumberFormatException: For input string: "33.19135811 -116.4482426"
  ...

@ "33.19135811 -116.4482426".split(',').map(_.toDouble) 
java.lang.NumberFormatException: For input string: "33.19135811 -116.4482426"
  ...

@ "33.19135811 -116.4482426".split(' ').map(_.toDouble) 
res3: Array[Double] = Array(33.19135811, -116.4482426)

answered Jun 18 '17 at 18:53

Dnomyar

707
5
17

Hi Dnomyar, Yes this is working after following your suggestion of using " " ... "33.19135811 -116.4482426".split(' ').map(_.toDouble) res3: Array[Double] = Array(33.19135811, -116.4482426) – amitk Jun 19 '17 at 09:03
I have this another query , Lets say we have a set of data with multiple columns something like this ... 2014-03-15:10:10:20 Sorrento 8cc3b47e-bd01-4482-b500-28f2342679af 33.68947543 -117.5433083 2014-03-15:10:10:20 MeeToo ef8c7564-0a1a-4650-a655-c8bbd5f8f943 37.43210889 -121.4850296 2014-03-15:10:10:20 MeeToo 23eba027-b95a-4729-9a4b-a3cca51c5548 39.43789083 -120.9389785 2014-03-15:10:10:20 Sorrento 707daba1-5640-4d60-a6d9-1d6fa0645be0 39.36351868 -119.4003347 And if i were to choose only the lattitude and longitude column and apply the k-means model, How to do that in Scala ? – amitk Jun 19 '17 at 09:08
Maybe this can help you : https://stackoverflow.com/questions/6647166/how-do-i-pattern-match-arrays-in-scala – Dnomyar Jun 19 '17 at 17:55

amitk · Answer 2 · 2017-06-30T06:21:08.477

In the previous case where were able to apply the split on a set of data("33.19135811 -116.4482426".split(' ').map(_.toDouble)) , But it seems that when we are applying the same split on multiple set of data, Am getting this error: 

                33.68947543 -117.5433083
                37.43210889 -121.4850296
                39.43789083 -120.9389785
                39.36351868 -119.4003347

    scala> val kmmodel= KMeans.train(parsedData,3,5)
    17/06/29 19:14:36 ERROR Executor: Exception in task 1.0 in stage 6.0 (TID 8)
    java.lang.NumberFormatException: empty String

k-means Clustering geolocated data using Spark/Scala

2 Answers2