-3

I have an archive _2016_08_17.zip that contains 8 .tsv files. I need to extract the file named hit_data.tsv and upload it to bigquery. The files are in a bucket on the google cloud platform.

Can someone give me a simple program that opens the archive, finds the correct file and then prints its rows to screen. I can take it from there. My idea is to replace the path gs://path_name/*hit_data.tsv with the buffer that contains the hit_data.tsv data.

    public static void main(String[] args) {
    Pipeline p = DataflowUtils.createFromArgs(args);

    p
            .apply(TextIO.Read.from("gs://path_name/*hit_data.tsv"))  
             \\.apply(Sample.<String>any(10))  
            .apply(ParDo.named("ExtractRows").of(new ExtractRows('\t', "InformationDateID")))
            .apply(BigQueryIO.Write
                    .named("BQWrite")
                    .to(BigQuery.getTableReference("ddm_now_apps", true))
                    .withSchema(getSchema())
                    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
                    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

    p.run();
}
Daniel Lee
  • 7,189
  • 2
  • 26
  • 44

2 Answers2

0

By definition, you can't read a file from a zip archive without unzipping it.

GreyBeardedGeek
  • 29,460
  • 2
  • 47
  • 67
0

We have ZipFile class. It has entries method that returns enumeration of entries. Now we can find entry or use getEntry method if we know name and path to file in zip.

Then, last step, we can use getInputStream method to read only entry that we want.

Koziołek
  • 2,791
  • 1
  • 28
  • 48