0

I am a beginner in the Spark world, and want to do my Machine Learning algorithms using SparkR.

I installed Spark in standalone mode in my laptop (Win 7 64-bit) and I am available to run Spark (1.6.1), Pyspark and begin SparkR in Windows following this effective guide: link . Once I started SparkR I began with the famous Flights example:

#Set proxy
Sys.setenv(http_proxy="http://user:password@proxy.companyname.es:8080/")
#Set SPARK_HOME
Sys.setenv(SPARK_HOME="C:/Users/amartinezsistac/spark-1.6.1-bin-hadoop2.4")
#Load SparkR and its library
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R", "lib"), .libPaths()))
library(SparkR)
#Set Spark Context and SQL Context
sc = sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
#Read Data
link <- "s3n://mortar-example-data/airline-data"
flights <- read.df(sqlContext, link, source = "com.databricks.spark.csv", header= "true")

Nevertheless, I receive the next error message after the last line:

Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
    at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:160)
    at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
    at org.apache.spark.api.r.RBackendHandler.ch

It seems like the reason is that I do not have installed the read-csv package, which can be downloaded from this page (Github link). As well as in Stack, in spark-packages.org website, (link) the advice is to do: $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 which is for a Linux installation.

My question is: How could I run this code line from Windows 7 cmd in order to download this package?

I also tried an alternate solution for my error message (Github) without success:

#In master you don't need spark-csv. 
#CSV data source is built into SparkSQL. Just use it as follows:
flights <- read.df(sqlContext, "out/data.txt", source = "com.databricks.spark.csv", delimiter="\t", header="true", inferSchema="true")

Thanks in advance to everyone.

Community
  • 1
  • 1
NuValue
  • 453
  • 3
  • 11
  • 28

1 Answers1

1

It is the same for Windows. When you start spark-shell from the bin directory, start it this way:

spark-shell --packages com.databricks:spark-csv_2.11:1.4.0
Daniel Zolnai
  • 16,487
  • 7
  • 59
  • 71
  • Hi Daniel. Thanks for your reply. In fact it started downloading the package. However, then cmd showed me a large message: "Unresolved dependencies. com.databricks:spark-csv_2.11:1.4.0 not found". Do you know why it could be? Thanks a lot. – NuValue May 03 '16 at 13:49
  • 1
    You could give the 2.10 version a try: `--packages com.databricks:spark-csv_2.10:1.4.0` – Daniel Zolnai May 03 '16 at 15:17
  • You probably need to set your proxy? See http://stackoverflow.com/questions/36676395/how-to-resolve-external-packages-with-spark-shell-when-behind-a-corporate-proxy – Boern Jun 16 '16 at 14:05