3

Is there a way to create a column with filename of the source that created each row ?

Use-Case: I would like to track which file in a GCS bucket resulted in the creation of which row in the resulting dataset. I would like a scheduled transformation of the files contained in a specific GCS bucket.

I've looked at the "metadata article" on GCP but it is pretty useless for my use-case.

UPDATED: I have opened a feature request with Google.

jldupont
  • 93,734
  • 56
  • 203
  • 318

1 Answers1

0

While they haven't closed that issue yet, this was part of the update last week.

There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps.

There are some caveats, such as it not returning a value for BigQuery sources and not persisting through pivot, join, or unnest . . . but it covers the vast majority of use cases handily, and in other cases you just need to materialize it before some of those destructive transforms.

NOTE: If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface (instead of just NULL values).

Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148

justbeez
  • 1,367
  • 7
  • 12