Write or append to multiple outputs by key in Spark

Question

I found how to write multiple outputs based on key value in Spark, but I need to write the same with append mode.

I am using saveAsHadoopFile() to write files, but my generateFileNameForKeyValue() method returns a filename based on a key that is similar for many keys, hence it is overwriting those files.

Can there is way to write in those files in append mode?

one alternative is use groupByKey() before saveAsHaddopFile(). I dont want to use groupByKey() as in involves too much shuffling.

My code snippet:

public class OutputFormat extends MultipleTextOutputFormat<String, CensusData2>  {



    public OutputFormat(){
    }


    @Override
    protected String generateActualKey(String key, CensusData2 value) {

        return null;
    }

    @Override
    protected String generateFileNameForKeyValue(String key, CensusData2 value, String name) {

        String fileName=key.replaceAll(" ", "").replaceAll("[^a-zA-Z0-9_-]", "")+"."+name+".out";
        return fileName;
    }
}

i found the way to do this... by overwriting `protected RecordWriter getBaseRecordWriter( FileSystem fs, JobConf job, String name, Progressable arg3)` method in my _OutputFormat_ class — Harshal Zope, Oct 17 '16 at 05:07
also answer in the [link](http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job) was not satisfying — Harshal Zope, Oct 17 '16 at 05:13

Write or append to multiple outputs by key in Spark

0 Answers0