3
OTC_omega_20210302.csv
CH_delta_20210302.csv
MD_omega_20210310.csv
CD_delta_20210310.csv

val hdfsPath = "/development/staging/abcd-efgh"
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)

val files = fs.listStatus(new Path(s"${hdfsPath}")).filterNot(_.isDirectory).map(_.getPath)
val regX = "OTC_*[0-9].csv|CH_*[0-9].csv".stripMargin.r
val filteredFiles = files.filter(fName => regX.findFirstMatchIn(fName.getName).isDefined)

What is regex do i need to give if i need any file name that starts with either (OTC_ or CH_ ) and ends with YYYYMMDD.csv ?

As per the above files i need two outputs OTC_omega_20210302.csv CH_delta_20210302.csv

Please help

Surender Raja
  • 3,553
  • 8
  • 44
  • 80

1 Answers1

0

You can use

val regX = "^(?:OTC|CH)_.*[0-9]{8}\\.csv$".r
val regX = """^(?:OTC|CH)_.*[0-9]{8}\.csv$""".r

See the regex demo.

Details:

  • ^ - start of string
  • (?:OTC|CH) - a non-capturing group matching either OTC or CH char sequences
  • _ - a _ char
  • .* - any zero or more chars other than line break chars, as many as possible
  • [0-9]{8} - eight digits
  • \. - a literal dot (note . matches any char other than a line break char, you must escape . to make it match a dot)
  • csv - a csv string
  • $ - end of string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563