I have a directory filled with 99 files, I want to read these files and then hash them into a sha256 checksum. I eventually want to output them to a JSON file with a key-value pair so for example (File 1, 092180x0123). Currently I am having trouble passing my ParDo function a readable File I must be missing something very easy. This is my first time using Apache beam so any help would be amazing. Here is what I have so far
public class BeamPipeline {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
p
.apply("Match Files", FileIO.match().filepattern("../testdata/input-*"))
.apply("Read Files", FileIO.readMatches())
.apply("Hash File",ParDo.of(new DoFn<FileIO.ReadableFile, KV<FileIO.ReadableFile, String>>() {
@ProcessElement
public void processElement(@Element FileIO.ReadableFile file, OutputReceiver<KV<FileIO.ReadableFile, String>> out) throws
NoSuchAlgorithmException, IOException {
// File -> Bytes
String strfile = file.toString();
byte[] byteFile = strfile.getBytes();
// SHA-256
MessageDigest md = MessageDigest.getInstance("SHA-256");
byte[] messageDigest = md.digest(byteFile);
BigInteger no = new BigInteger(1, messageDigest);
String hashtext = no.toString(16);
while(hashtext.length() < 32) {
hashtext = "0" + hashtext;
}
out.output(KV.of(file, hashtext));
}
}))
.apply(FileIO.write());
p.run();
}
}