SparkR in Windows

Question

I'm trying to read parquet files in SparkR that is installed in Windows. When I issue the following command all_tweets <- collect(read.parquet(sqlContext,"hdfs://localhost:9000/orcladv/internet/rawtweets"))

I get an error Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/C:/Users/xxxxx/Documents/hdfs:/localhost:9000/orcladv/internet/rawtweets.

    at scala.Predef$.assert(Predef.scala:179)

I'm not sure why it prefixes file:/C:/Users as it is a hdfs://localhost:9000

Please help..

Thanks

Bala

Unrelated, but maybe useful to you ... be cautious about using collect(). Try to get the data directly from the parquet file into a Spark DataFrame. By using collect(), you have blown through the Spark DataFrame and created an R data.frame, which will be processed in a single thread. It does not use Spark's distributed processing power. Conversely, if you remove collect() from your code you will have created a Spark DataFrame which will be processed multi-thread across the cluster. If your data set is small enough to be handled in R data.frame, why bother using Spark? — SpiritusPrana, Jul 09 '16 at 14:43
Thanks SpiritusPrana. Yes i understand the significance of using collect. For the business logic in this case, i can certainly do away with the collect() function as the logic is pretty much working on an individual row. It does not need to understand any information from other rows and in that sense rows are mutually exclusive to each other. However if we use Spark**R** many times we will have to operate on a serial flow of data. This could be due to the kind of analytical process that one is running. Thanks again for your caution, in this case i can definitely remove collect() — Balaji Krishnan, Jul 11 '16 at 08:55

score 0 · Answer 1 · edited May 23 '17 at 12:22

0

Does this post help? It seems related, and offers some clues on how to find the correct hdfs path.

Change "localhost" to the value in fs.defaultFS in your core-site.xml file.

If the hdfs path is invalid then Spark seems to assume that it must look in the local file system.

edited May 23 '17 at 12:22

Community

1
1

answered Jul 09 '16 at 14:42

SpiritusPrana

480
3
13

I tried this as well with no luck. What is surprising is a) the same code works in a Linux based Hadoop/Spark environment b) where does it get the path to prefix the hdfs://. – Balaji Krishnan Jul 11 '16 at 08:58
all_tweets <- read.parquet(sqlContext,"hdfs:///localhost:9000//orcla 16/07/12 00:57:30 INFO parquet.ParquetRelation: Listing hdfs://localhost:9000/C:/installs/spark-1.6.1-bin-hadoop2.6/bin/hdfs:/localhost:9000/orcladv/internet/rawtweets on driver 16/07/12 00:57:30 ERROR r.RBackendHandler: parquet on 5 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data .. under hdfs://localhost:9000/C:/installs/spark-1.6.1-bin-h adoop2.6/bin/hdfs:/localhost:9000/orcladv/internet/rawtweets – Balaji Krishnan Jul 12 '16 at 07:12

SparkR in Windows

1 Answers1