Here is what we had implemented for our usecase in java : Write to different files with prefix depending upon the content of avro record using AvroMultipleOutputs.
Here is the wrapper on top of OutputFormat to produce multiple outputs using AvroMultipleOutputs similar to what @Ram has mentioned. https://github.com/architch/MultipleAvroOutputsFormat/blob/master/MultipleAvroOutputsFormat.java
It can be used to write avro records to multiple paths in spark the following way:
Job job = Job.getInstance(hadoopConf);
AvroJob.setOutputKeySchema(job, schema);
AvroMultipleOutputs.addNamedOutput(job,"type1",AvroKeyOutputFormat.class,schema);
AvroMultipleOutputs.addNamedOutput(job,"type2",AvroKeyOutputFormat.class,schema);
rdd.mapToPair(event->{
if(event.isType1())
return new Tuple2<>(new Tuple2<>("type1",new AvroKey<>(event.getRecord())),NullWritable.get());
else
return new Tuple2<>(new Tuple2<>("type2",new AvroKey<>(event.getRecord())),NullWritable.get());
})
.saveAsNewAPIHadoopFile(
outputBasePath,
GenericData.Record.class,
NullWritable.class,
MultipleAvroOutputsFormat.class,
job.getConfiguration()
);
Here getRecords returns a GenericRecord.
The output would be like this at outputBasePath:
17359 May 28 15:23 type1-r-00000.avro
28029 May 28 15:24 type1-r-00001.avro
16473 May 28 15:24 type1-r-00003.avro
17124 May 28 15:23 type2-r-00000.avro
30962 May 28 15:24 type2-r-00001.avro
16229 May 28 15:24 type2-r-00003.avro
This can also be used to write to different directories altogether by providing the baseOutputPath directly as mentioned here: write to multiple directory