Spark Scala list folders in directory

Question

I want to list all folders within a hdfs directory using Scala/Spark. In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/

I tried it with:

val conf = new Configuration()
val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)

val path = new Path("hdfs://sandbox.hortonworks.com/demo/")

val files = fs.listFiles(path, false)

But it does not seem that he looks in the Hadoop directory as i cannot find my folders/files.

I also tried with:

FileSystem.get(sc.hadoopConfiguration).listFiles(new Path("hdfs://sandbox.hortonworks.com/demo/"), true)

But this also does not help.

Do you have any other idea?

PS: I also checked this thread: Spark iterate HDFS directory but it does not work for me as it does not seem to search on hdfs directory, instead only on the local file system with schema file//.

This solution helped me with a bug. I needed to do code like `val fs = FileSystem.get(new URI("s3://mybucket/mykey"), conf)` to get the correct FileSystem for spark to use. The default FileSystem was for hdfs. — Don Smith, Nov 12 '19 at 23:42

score 42 · Accepted Answer · answered Oct 28 '15 at 16:03

42

We are using hadoop 1.4 and it doesn't have listFiles method so we use listStatus to get directories. It doesn't have recursive option but it is easy to manage recursive lookup.

val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))
status.foreach(x=> println(x.getPath))

answered Oct 28 '15 at 16:03

nil

563
5
8

2

Thanks a lot, listStatus is much better for getting the folders and works nicely! In my case i dont need a recursive lookup, so thats perfectly fine. **One addition**: When I am using your coding, the filesystem schema is file:// and i cannot use hdfs:// as schema. So I created the Filesystem this way: `val conf = new Configuration() val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)`. Then the Filesystem accepts hdfs:// paths. – AlexL Oct 29 '15 at 08:21
2

"error: not found: type Configuration", how to import or prepare it? Using `import org.apache.hadoop.conf.Configuration` – Peter Krauss Sep 18 '19 at 15:41

score 20 · Answer 2 · edited Sep 06 '19 at 05:03

20

In Spark 2.0+,

import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfs-path}")).filter(_.isDir).map(_.getPath).foreach(println)

Hope this is helpful.

edited Sep 06 '19 at 05:03

Alex Raj Kaliamoorthy

2,035
3
29
46

answered Jun 29 '19 at 03:25

Ajay Ahuja

1,196
11
26

user3190018 · Answer 3 · 2020-03-06T01:42:11.107

in Ajay Ahujas answer isDir is deprecated..

use isDirectory... pls see complete example and output below.

package examples

    import org.apache.log4j.Level
    import org.apache.spark.sql.SparkSession

    object ListHDFSDirectories  extends  App{
      val logger = org.apache.log4j.Logger.getLogger("org")
      logger.setLevel(Level.WARN)
      val spark = SparkSession.builder()
        .appName(this.getClass.getName)
        .config("spark.master", "local[*]").getOrCreate()

      val hdfspath = "." // your path here
      import org.apache.hadoop.fs.{FileSystem, Path}
      val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
      fs.listStatus(new Path(s"${hdfspath}")).filter(_.isDirectory).map(_.getPath).foreach(println)
    }

Result :

file:/Users/user/codebase/myproject/target
file:/Users/user/codebase/myproject/Rel
file:/Users/user/codebase/myproject/spark-warehouse
file:/Users/user/codebase/myproject/metastore_db
file:/Users/user/codebase/myproject/.idea
file:/Users/user/codebase/myproject/src

Nitin · Answer 4 · 2023-03-22T16:53:46.327

5

 val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration)
.globStatus(new org.apache.hadoop.fs.Path(url))

  for (urlStatus <- listStatus) {
    println("urlStatus get Path:" + urlStatus.getPath())
}

edited Mar 22 '23 at 16:53

answered Jan 10 '17 at 04:12

Nitin

3,533
2
26
36

score 4 · Answer 5 · answered Mar 02 '17 at 15:51

I was looking for the same, however instead of HDFS, for S3.

I solved creating the FileSystem with my S3 path as below:

  def getSubFolders(path: String)(implicit sparkContext: SparkContext): Seq[String] = {
    val hadoopConf = sparkContext.hadoopConfiguration
    val uri = new URI(path)

    FileSystem.get(uri, hadoopConf).listStatus(new Path(path)).map {
      _.getPath.toString
    }
  }

I know this question was related for HDFS, but maybe others like me will come here looking for S3 solution. Since without specifying the URI in FileSystem, it will look for HDFS ones.

java.lang.IllegalArgumentException: Wrong FS: s3://<bucket>/dummy_path
expected: hdfs://<ip-machine>.eu-west-1.compute.internal:8020

can you put an example use of this please? – AZhao Oct 07 '20 at 04:03 — AZhao, Oct 07 '20 at 04:03

score 3 · Answer 6 · answered Nov 28 '17 at 12:53

val spark = SparkSession.builder().appName("Demo").getOrCreate()
val path = new Path("enter your directory path")
val fs:FileSystem = projects.getFileSystem(spark.sparkContext.hadoopConfiguration)
val it = fs.listLocatedStatus(path)

This will create an iterator it over org.apache.hadoop.fs.LocatedFileStatus that is your subdirectory

score 1 · Answer 7 · edited Sep 22 '17 at 08:19

1

Azure Blog Storage is mapped to a HDFS location, so all the Hadoop Operations

On Azure Portal, go to Storage Account, you will find following details:

Storage account
Key -
Container -
Path pattern – /users/accountsdata/
Date format – yyyy-mm-dd
Event serialization format – json
Format – line separated

Path Pattern here is the HDFS path, you can login/putty to the Hadoop Edge Node and do:

hadoop fs -ls /users/accountsdata

Above command will list all the files. In Scala you can use

import scala.sys.process._ 

val lsResult = Seq("hadoop","fs","-ls","/users/accountsdata/").!!

edited Sep 22 '17 at 08:19

g00glen00b

41,995
13
95
133

answered Sep 22 '17 at 07:04

Yogesh_JavaJ2EE

63
1
9

It helped me to get the log size in Jupyter notebook when Spark History Server has memory issues. – Joy George Kunjikkuru May 09 '18 at 20:04

score 0 · Answer 8 · answered Aug 22 '17 at 12:54

object HDFSProgram extends App {    
  val uri = new URI("hdfs://HOSTNAME:PORT")    
  val fs = FileSystem.get(uri,new Configuration())    
  val filePath = new Path("/user/hive/")    
  val status = fs.listStatus(filePath)    
  status.map(sts => sts.getPath).foreach(println)    
}

This is sample code to get list of hdfs files or folder present under /user/hive/

score -2 · Answer 9 · answered Oct 29 '15 at 17:46

Because you're using Scala, you may also be interested in the following:

import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","hdfs://sandbox.hortonworks.com/demo/").!!

This will, unfortunately, return the entire output of the command as a string, and so parsing down to just the filenames requires some effort. (Use fs.listStatus instead.) But if you find yourself needing to run other commands where you could do it in the command line easily and are unsure how to do it in Scala, just use the command line through scala.sys.process._. (Use a single ! if you want to just get the return code.)

Spark Scala list folders in directory

9 Answers9

Linked

Related