1

I see Hive to hive data movement has a look back configuration in Gobblin where we can specify from which dates of the partitions we want to copy using

gobblin.data.management.copy.hive.filter.LookbackPartitionFilterGenerator

Is there a similar look back configuration for HDFS to GCS (Google cloud storage) data copy in Gobblin that can copy files only after a particular partition date?

I have my files in HDFS which are partitioned by dates.

1 Answers1

1

If you are looking to copy time-partitioned HDFS files to GCS, you can use the TimeAwareCopyableGlobDatasetFinder. This dataset finder instantiates a TimeAwareRecursiveCopyableDataset which accepts a config to specify a look back time, specified as number of days/hours/minutes. The underlying distcp job will copy all the partitions of the dataset upto the specified look back time.

As an example, if you are interested in copying all the hourly partitions of a dataset for the last 2 days, your Gobblin distcp job would include the following configs:

gobblin.dataset.profile.class="org.apache.gobblin.data.management.copy.TimeAwareCopyableGlobDatasetFinder"
gobblin.dataset.pattern=/root/dataset/path
gobblin.copy.recursive.date.pattern=yyyy-MM-dd-HH
gobblin.copy.recursive.lookback.time=2d
Dharman
  • 30,962
  • 25
  • 85
  • 135
sv2000
  • 46
  • 1