I want to get the size of a file in Spark without pouring it into an RDD. How Can I do that? I don't want to pour it into a RDD since it can cause an overload to my application and my files are large.
Asked
Active
Viewed 1,969 times
1
-
What do you mean "pour into RDD"? As soon as you read a file, it creates an RDD, which is an abstraction, not a data-structure with an actual size until it is collected – OneCricketeer Oct 12 '16 at 22:27
-
Can I just read this info from file properties so not have to read all contents to a RDD? maybe using a simple java API..i guessing reading to an RDD and then estimating the size make spark read all conexstes as long my file it is large its extra overhead.. – mahdi62 Oct 12 '16 at 23:29
-
Where is the file stored? HDFS or S3? You can read the file size directly from those respective Java API, yes – OneCricketeer Oct 12 '16 at 23:34
-
wanted to know if there is anything in Spark? – mahdi62 Oct 12 '16 at 23:43
-
You didn't answer the question. Where is the file stored? – OneCricketeer Oct 12 '16 at 23:44
-
on one HDFS directory in the cluster – mahdi62 Oct 13 '16 at 01:51
-
And why can't you just use Java HDFS API? http://stackoverflow.com/questions/8167153/how-to-get-file-size As I said, Spark isn't the solution and RDD's don't have a "size" – OneCricketeer Oct 13 '16 at 01:56