So I recently got a macbook and wanted to get into learning Spark and Scala. I went through a few guides online on how to install Scala, Hadoop, Spark, and since I wanted to try a new IDE I installed Intellij.
I've been running into this issue.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/03/19 18:07:38 INFO SparkContext: Running Spark version 2.3.0
18/03/19 18:07:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/19 18:07:38 INFO SparkContext: Submitted application: Spark Count
18/03/19 18:07:38 INFO SecurityManager: Changing view acls to: jeanmac
18/03/19 18:07:38 INFO SecurityManager: Changing modify acls to: jeanmac
18/03/19 18:07:38 INFO SecurityManager: Changing view acls groups to:
18/03/19 18:07:38 INFO SecurityManager: Changing modify acls groups to:
18/03/19 18:07:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jeanmac); groups with view permissions: Set(); users with modify permissions: Set(jeanmac); groups with modify permissions: Set()
18/03/19 18:07:39 INFO Utils: Successfully started service 'sparkDriver' on port 61094.
18/03/19 18:07:39 INFO SparkEnv: Registering MapOutputTracker
18/03/19 18:07:39 INFO SparkEnv: Registering BlockManagerMaster
18/03/19 18:07:39 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/03/19 18:07:39 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/03/19 18:07:39 INFO DiskBlockManager: Created local directory at /private/var/folders/r5/rfwd1cqd4kv8cmh5gh_qxpvm0000gn/T/blockmgr-c8a5c1ac-8e09-4352-928e-1169a96cd752
18/03/19 18:07:39 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
18/03/19 18:07:39 INFO SparkEnv: Registering OutputCommitCoordinator
18/03/19 18:07:39 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/03/19 18:07:39 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://jeanss-mbp:4040
18/03/19 18:07:39 INFO Executor: Starting executor ID driver on host localhost
18/03/19 18:07:39 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61095.
18/03/19 18:07:39 INFO NettyBlockTransferService: Server created on jeanss-mbp:61095
18/03/19 18:07:39 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/03/19 18:07:39 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, jeanss-mbp, 61095, None)
18/03/19 18:07:39 INFO BlockManagerMasterEndpoint: Registering block manager jeanss-mbp:61095 with 2004.6 MB RAM, BlockManagerId(driver, jeanss-mbp, 61095, None)
18/03/19 18:07:39 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, jeanss-mbp, 61095, None)
18/03/19 18:07:39 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, jeanss-mbp, 61095, None)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at scala.ScalaWordCount$.main(ScalaWordCount.scala:10)
at scala.ScalaWordCount.main(ScalaWordCount.scala)
18/03/19 18:07:40 INFO SparkContext: Invoking stop() from shutdown hook
18/03/19 18:07:40 INFO SparkUI: Stopped Spark web UI at http://jeanss-mbp:4040
18/03/19 18:07:40 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/03/19 18:07:40 INFO MemoryStore: MemoryStore cleared
18/03/19 18:07:40 INFO BlockManager: BlockManager stopped
18/03/19 18:07:40 INFO BlockManagerMaster: BlockManagerMaster stopped
18/03/19 18:07:40 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/03/19 18:07:40 INFO SparkContext: Successfully stopped SparkContext
18/03/19 18:07:40 INFO ShutdownHookManager: Shutdown hook called
18/03/19 18:07:40 INFO ShutdownHookManager: Deleting directory /private/var/folders/r5/rfwd1cqd4kv8cmh5gh_qxpvm0000gn/T/spark-331c36a6-b985-4056-900e-88250052ebb3
From all this what stuck out to me was this error:
WARN NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I did some googling and tried a few different results but I can't seem to get my program to run.
Aside from that error, I looked at the Exception error as well:
18/03/19 18:07:39 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, jeanss-mbp, 61095, None)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at scala.ScalaWordCount$.main(ScalaWordCount.scala:10)
at scala.ScalaWordCount.main(ScalaWordCount.scala)
Line 10: val threshold = args(1).toInt
So I'm asking for help on how to correctly fix this issue. I will provide my system and ide configuration below.
Versions:
- Scale: 2.12.4 (Java HotSpot(TM) 64-Bit Server VM, Java 9.0.4).
- Haddop: 3.0.0
- Java: 9.0.4 Java(TM) SE Runtime Environment (build 9.0.4+11) Java HotSpot(TM) 64-Bit Server VM (build 9.0.4+11, mixed mode)
- Spark: 2.3.0 Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162)
Intellij Configuration:
WordCount.scala:
package scala import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object ScalaWordCount { def main(args: Array[String]) { val sc = new SparkContext(new SparkConf().setAppName("Spark Count").setMaster("local[2]")) val threshold = args(1).toInt // split each document into words val tokenized = sc.textFile(args(0)).flatMap(_.split(" ")) // count the occurrence of each word val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _) // filter out words with less than threshold occurrences val filtered = wordCounts.filter(_._2 >= threshold) // count characters val charCounts = filtered.flatMap(_._1.toCharArray).map((_, 1)).reduceByKey(_ + _) System.out.println(charCounts.collect().mkString(", ")) } }
build.sbt:
name := "untitled" version := "0.1" scalaVersion := "2.11.8" val sparkVersion = "2.3.0" resolvers ++= Seq( "apache-snapshots" at "http://repository.apache.org/snapshots/" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion, "org.apache.spark" %% "spark-streaming" % sparkVersion, "org.apache.spark" %% "spark-hive" % sparkVersion )
I'm not too fluent in scala and spark so it might just be something wrong in my code rather than my environment. But if any of you have any troubleshooting steps you think i should take let me know. And if you need any other configuration I'll be happy to update this post with them.