How can one list all csv files in an HDFS location within the Spark Scala shell?

Question

The purpose of this is in order to manipulate and save a copy of each data file in a second location in HDFS. I will be using

RddName.coalesce(1).saveAsTextFile(pathName)

to save the result to HDFS.

This is why I want to do each file separately even though I am sure the performance will not be as efficient. However, I have yet to determine how to store the list of CSV file paths into an array of strings and then loop through each one with a separate RDD.

Let us use the following anonymous example as the HDFS source locations:

/data/email/click/date=2015-01-01/sent_20150101.csv
/data/email/click/date=2015-01-02/sent_20150102.csv
/data/email/click/date=2015-01-03/sent_20150103.csv

I know how to list the file paths using Hadoop FS Shell:

HDFS DFS -ls /data/email/click/*/*.csv

I know how to create one RDD for all the data:

val sentRdd = sc.textFile( "/data/email/click/*/*.csv" )

zero323 · Answer 1 · 2016-07-05T10:40:11.643

10

I haven't tested it thoroughly but something like this seems to work:

import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.hadoop.fs.{FileSystem, Path, LocatedFileStatus, RemoteIterator}
import java.net.URI

val path: String = ???

val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hconf)
val iter = hdfs.listFiles(new Path(path), false)

def listFiles(iter: RemoteIterator[LocatedFileStatus]) = {
  def go(iter: RemoteIterator[LocatedFileStatus], acc: List[URI]): List[URI] = {
    if (iter.hasNext) {
      val uri = iter.next.getPath.toUri
      go(iter, uri :: acc)
    } else {
      acc
    }
  }
  go(iter, List.empty[java.net.URI])
}

listFiles(iter).filter(_.toString.endsWith(".csv"))

edited Jul 05 '16 at 10:40

answered Sep 24 '15 at 22:16

zero323

322,348
103
959
935

any reason you have to use URI? can i just use Path, and the return result is List[Path] – soMuchToLearnAndShare Jul 05 '16 at 07:29
@MinnieShi It don't see any reason why you couldn't – zero323 Jul 05 '16 at 10:47
tail recursion can be replaced by a while loop – John Lin Aug 23 '18 at 08:38

score 1 · Answer 2 · answered Oct 06 '15 at 20:04

This is what ultimately worked for me:

import org.apache.hadoop.fs._
import org.apache.spark.deploy.SparkHadoopUtil
import java.net.URI

val hdfs_conf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hdfs_conf)
// source data in HDFS
val sourcePath = new Path("/<source_location>/<filename_pattern>")

hdfs.globStatus( sourcePath ).foreach{ fileStatus =>
   val filePathName = fileStatus.getPath().toString()
   val fileName = fileStatus.getPath().getName()

   // < DO STUFF HERE>

} // end foreach loop

score -1 · Answer 3 · answered Sep 25 '15 at 02:46

-1

sc.wholeTextFiles(path) should help. It gives an rdd of (filepath, filecontent).

answered Sep 25 '15 at 02:46

Huy Banh

29
4

3

Wouldn't that be using the data though? I just want to iterate through each filepath within. – Jaime Sep 25 '15 at 15:23

How can one list all csv files in an HDFS location within the Spark Scala shell?

3 Answers3

Linked