I have an archive _2016_08_17.zip
that contains 8 .tsv files. I need to extract the file named hit_data.tsv
and upload it to bigquery. The files are in a bucket on the google cloud platform.
Can someone give me a simple program that opens the archive, finds the correct file and then prints its rows to screen. I can take it from there. My idea is to replace the path gs://path_name/*hit_data.tsv
with the buffer that contains the hit_data.tsv
data.
public static void main(String[] args) {
Pipeline p = DataflowUtils.createFromArgs(args);
p
.apply(TextIO.Read.from("gs://path_name/*hit_data.tsv"))
\\.apply(Sample.<String>any(10))
.apply(ParDo.named("ExtractRows").of(new ExtractRows('\t', "InformationDateID")))
.apply(BigQueryIO.Write
.named("BQWrite")
.to(BigQuery.getTableReference("ddm_now_apps", true))
.withSchema(getSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
p.run();
}