I was wondering if sparkR makes it easier to merge large data sets as opposed to "regular R"? I have 12 csv files that are approximately 500,000 rows by 40 columns. These files are monthly data for the year 2014. I want to make one file for the year 2014. The files all have the same column labels and I want to merge by the first column (year). However, some files have more rows than others.
When I ran the following code:
setwd("C:\\Users\\Anonymous\\Desktop\\Data 2014")
file_list <- list.files()
for (file in file_list){
# if the merged dataset doesn't exist, create it
if (!exists("dataset")){
dataset <- read.table(file, header=TRUE, sep="\t")
}
# if the merged dataset does exist, append to it
if (exists("dataset")){
temp_dataset <-read.table(file, header=TRUE, sep="\t")
dataset<-rbind(dataset, temp_dataset)
rm(temp_dataset)
}
}
R crashed.
When I ran this code:
library(SparkR)
library(magrittr)
# setwd("C:\\Users\\Anonymous\\Desktop\\Data 2014\\Jan2014.csv")
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)
Jan2014_file_path <- file.path( 'Jan2014.csv')
system.time(
housing_a_df <- read.df(sqlContext,
"C:\\Users\\Anonymous\\Desktop\\Data 2014\\Jan2014.csv",
header='true',
inferSchema='false')
)
I got the following errors:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
So what would be an easy way to merge these files in sparkR?