How to load data from HDFS using SparkR?

Question

I am new to Hadoop and Spark. I am using Spark-2.1.1-bin-hadoop2.7. Using SparkR I want to load (read) data from Hadoop 2.7.3 HDFS.

I know, I can point to my Hadoop file using "hdfs://somepath-to-my-file" but I could not find a function in SparkR to do the job. read.df() doesn't work.

I am using sparkR.session() to connect to my Spark session. For launching R interface for Spark, I ran sparkR from spark's bin location.

In short, I want to load csv file from HDFS using sparkR.

Please help. If possible, provide example.

Thanks, SG

score 0 · Answer 1 · answered Aug 08 '17 at 13:38

0

Using spark dataFrame to load data from HDFS

For converting Data frame to RDD:

converted.rdd <- SparkR:::toRDD(dataframe)

answered Aug 08 '17 at 13:38

Arun Gunalan

814
7
26

Hi Arun, thanks for the reply. May be I am wrong, but I couldn't find toRDD() function. I also visited the link provided by you, I was not able to execute those commands successfully from the link. I also edited my question tiny bit so that I can make myself clear about what I am looking for. – SGSO Aug 09 '17 at 03:10
I have answered in this link - https://stackoverflow.com/questions/44463640/reading-csv-data-into-sparkr-after-writing-it-out-from-a-dataframe/44523247#44523247 – Arun Gunalan Aug 09 '17 at 04:23
Hi Arun, I saw another link provided by you. Since I am using Spark 2.1.1, it does not seems to have sqlContext. I tried exact steps from your link, except that I have Spark 2.1.1. May be I am missing something. Thanks. – SGSO Aug 09 '17 at 09:52
read.df since 1.4.0 & loadDF since 1.6.0. kindly post what error you are getting.. – Arun Gunalan Aug 10 '17 at 05:28
Hi @Arun, sorry for the delay. Because of some issues I had to switch to Spark 2.2.0. Here is the error that I see when using read.df()/loadDF(). I could not grab the full screen, ------- imes; aborting job 17/08/11 10:17:02 ERROR r.RBackendHandler: loadDF on org.apache.spark.sql.api.r. SQLUtils failed java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:62) – SGSO Aug 11 '17 at 05:04
Dont know how to format here. I used
for line break. In pyspark or scala reading from hdfs works. See below (for scala example):
scala> val myFile="hdfs://localhost:9000/mydata/train.csv"
myFile: String = hdfs://localhost:9000/mydata/train.csv
scala> val txtfile = sc.textFile(myFile)
txtfile: org.apache.spark.rdd.RDD[String] = hdfs://localhost:9000/mydata/train.csv MapPartitionsRDD[1] at textFile at :26
scala> txtfile.count()
res0: Long = 892
– SGSO Aug 11 '17 at 05:34

Arun Gunalan · Answer 2 · 2017-08-11T12:00:41.043

I cannot find solution to the error, kindly share the code you have give to execute(give fully how you launched the sparkR & read the csv file)

A work around solution would be.

Option A:
# reading the file stored in local not in HDFS but in R
R_dataframe <- read.csv(file="mydata/train.csv", header=TRUE, sep=",")
#Note(If the file is too big it cant hold as it depends on RAM size, 
so u can increase the driver memory while starting sparkR 
by ".bin/sparkR --driver-memory 12g")

#convert R dataframe to sparkR dataframe:
sparkR_dataframe <- as.DataFrame(R_dataframe)

Option B:
Read it in Python as RDD from HDFS and save as parquet file.
#convert RDD to dataframe
df = rdd.toDF()
#save df as parquet file.
df.write.parquet("train.parquet")
#In sparkR session read the file:
train_df <- read.parquet("train.parquet")

score 0 · Answer 3 · answered Aug 11 '17 at 12:36

The solution is to mention "source" parameter in read.df() and loadDF(). Here are my steps:

#step1: launch hadoop standalone (using start-dfs.cmd and start-yarn.cmd)
#step2: launch spark standalone just by typing sparkR
#step3: run following command to load a csv file from HDFS
myfile <- read.df(path="hdfs://localhost:9000/mydata/train.csv", source="csv")

#step4: print few lines from myfile
head(myfile)

Note that in my code above my train.csv file resides in mydata directory in HDFS. Functions read.df() and loadDF() worked equally. This whole setup was in Windows8.1 with following:

Hadoop 2.7.3 (standalone)
Spark 2.2.0 (standalone)
R 3.3.3 x64
Java version 1.8.0_144

Thanks.

#step2: launch spark standalone just by typing sparkR - `sparkR --packages com.databricks:spark-csv_2.10:1.0.3` #step3: run following command to load a csv file from HDFS - `myfile <- read.df(sqlContext, "hdfs://localhost:9000/mydata/train.csv", "com.databricks.spark.csv", header="true")` Have you tried this — Arun Gunalan, Aug 11 '17 at 13:23

How to load data from HDFS using SparkR?

3 Answers3