3

I have a gcs folder as below:

gs://<bucket-name>/<folder-name>/dt=2017-12-01/part-0000.tsv
                                /dt=2017-12-02/part-0000.tsv
                                /dt=2017-12-03/part-0000.tsv
                                /dt=2017-12-04/part-0000.tsv
                                ...

I want to match only the files under dt=2017-12-02 and dt=2017-12-03 using sc.textFile() in Scio, which uses TextIO.Read.from() underneath as far as I know.

I've tried

gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv

and

gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv

Both match zero files:

INFO org.apache.beam.sdk.io.FileBasedSource - Filepattern gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv matched 0 files with total size 0

INFO org.apache.beam.sdk.io.FileBasedSource - Filepattern gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv matched 0 files with total size 0

What should be the valid filepattern on doing this?

pishen
  • 319
  • 3
  • 16

2 Answers2

1

You need to use the TextIO.readAll() transform that reads a PCollection<String> of filepatterns. Create the collection of filepatterns either explicitly via Create.of() or you can compute it using a ParDo.

case class ReadPaths(paths: java.lang.Iterable[String]) extends PTransform[PBegin, PCollection[String]] {
  override def expand(input: PBegin) = {
    Create.of(paths).expand(input).apply(TextIO.readAll())
  }
}

val paths = Seq(
  "gs://<bucket-name>/<folder-name>/dt=2017-07-01/part-0000.tsv",
  "gs://<bucket-name>/<folder-name>/dt=2017-12-20/part-0000.tsv",
  "gs://<bucket-name>/<folder-name>/dt=2018-03-29/part-0000.tsv",
  "gs://<bucket-name>/<folder-name>/dt=2018-05-04/part-0000.tsv"
)

import scala.collection.JavaConverters._

sc.customInput("Read Paths", ReadPaths(paths.asJava))
pishen
  • 319
  • 3
  • 16
jkff
  • 17,623
  • 5
  • 53
  • 85
1

This might work:

gs://bucket/folder/dt=2017-12-0[12]/*.tsv

Reference: https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames

Andrew Nguonly
  • 2,258
  • 1
  • 17
  • 23
  • I have not tried it myself but this seems the right syntax for what @pishen is attempting. – Mar Cial R Feb 17 '18 at 08:29
  • hmmm, sorry I didn't describe it well. I know `[]` can handle single char, but what I want is a general OR syntax. For example, if I want to match `2017-12-01` and `2017-12-20`, I can't use `[]` to match it. – pishen Feb 21 '18 at 10:09
  • 1
    I don't believe there is any OR functionality in the URI wildcard syntax for GCS. You'll have to be explicit about specifying full GCS directories. – Andrew Nguonly Feb 21 '18 at 15:31