1

I have approx 1 millions text files stored in S3 . I want to rename all files based on their folders name.

How can i do that in spark-scala ?

I am looking for some sample code .

I am using zeppelin to run my spark script .

Below code I have tried as suggested from answer

import org.apache.hadoop.fs._

val src = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN")
val dest = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN/dest")
val conf = sc.hadoopConfiguration   // assuming sc = spark context
val fs = Path.getFileSystem(conf)
fs.rename(src, dest)

But getting below error

<console>:110: error: value getFileSystem is not a member of object org.apache.hadoop.fs.Path
       val fs = Path.getFileSystem(conf)
Sudarshan kumar
  • 1,503
  • 4
  • 36
  • 83
Atharv Thakur
  • 671
  • 3
  • 21
  • 39

1 Answers1

8

you can use the normal HDFS APIs, something like (typed in, not tested)

val src = new Path("s3a://bucket/data/src")
val dest = new Path("s3a://bucket/data/dest")
val conf = sc.hadoopConfiguration   // assuming sc = spark context
val fs = src.getFileSystem(conf)
fs.rename(src, dest)

The way the S3A client fakes a rename is a copy + delete of every file, so the time it takes is proportional to the #of files, and the amount of data. And S3 throttles you: if you try to do this in parallel, it will potentially slow you down. Don't be surprised if it takes "a while".

You also get billed per COPY call, at 0.005 per 1,000 calls, so it will cost you ~$5 to try. Test on a small directory until you are sure everything is working

stevel
  • 12,567
  • 1
  • 39
  • 50
  • tried just now but getting error updated my question please have a look once – Atharv Thakur Jan 12 '18 at 10:04
  • OK, found an error in my code. That said, you are going to need the entirety of the Hadoop & Spark source trees in your IDE if you are doing serious work at this level. Be warned, and start practicing this early – stevel Jan 15 '18 at 12:37
  • already raised two bounty for this requirement ..My colleague has raised one active bounty also .https://stackoverflow.com/questions/46703623/how-to-rename-spark-data-frame-output-file-in-aws-in-spark-scala ... – Atharv Thakur Jan 15 '18 at 12:59
  • I've fixed my code, yiou should use 'src.getFileSystem()`; its a non-absdtract method. As I warned, not typed, tested. – stevel Jan 16 '18 at 11:11
  • Yes it is I up voted for that ..Thank you so much ..But Just one more thing I have so many files inside the src folders that I want to rename and move it to other folder ...My cooleague has created separate question for that ..Can you please look at that question ...We have 100 point bounty also on that ..If you can help that would be great – Atharv Thakur Jan 16 '18 at 11:34
  • added new https://stackoverflow.com/questions/48280879/rename-and-move-s3-files-based-on-their-folders-name-in-spark-scala – Atharv Thakur Jan 16 '18 at 16:02
  • Sorry, you are trying to learn how to write basic Hadoop FS API code on stack overflow. That's not what you should be doing. Get an IDE, look at the documentation, write some tests cases to do what you are trying to do, evolve them. – stevel Jan 16 '18 at 16:29
  • Hi Atharv, I am also working on similar issue, did you get the solution.? – TheCodeCache Sep 23 '18 at 08:11