1

I want to get the size of a file in Spark without pouring it into an RDD. How Can I do that? I don't want to pour it into a RDD since it can cause an overload to my application and my files are large.

mahdi62
  • 959
  • 2
  • 11
  • 17
  • What do you mean "pour into RDD"? As soon as you read a file, it creates an RDD, which is an abstraction, not a data-structure with an actual size until it is collected – OneCricketeer Oct 12 '16 at 22:27
  • Can I just read this info from file properties so not have to read all contents to a RDD? maybe using a simple java API..i guessing reading to an RDD and then estimating the size make spark read all conexstes as long my file it is large its extra overhead.. – mahdi62 Oct 12 '16 at 23:29
  • Where is the file stored? HDFS or S3? You can read the file size directly from those respective Java API, yes – OneCricketeer Oct 12 '16 at 23:34
  • wanted to know if there is anything in Spark? – mahdi62 Oct 12 '16 at 23:43
  • You didn't answer the question. Where is the file stored? – OneCricketeer Oct 12 '16 at 23:44
  • on one HDFS directory in the cluster – mahdi62 Oct 13 '16 at 01:51
  • And why can't you just use Java HDFS API? http://stackoverflow.com/questions/8167153/how-to-get-file-size As I said, Spark isn't the solution and RDD's don't have a "size" – OneCricketeer Oct 13 '16 at 01:56

0 Answers0