I need to implement a Java Spark program to count the tuples with the same column value at the given index. whereby the Command line parameters are [Input path] [column index] [output path]. The Input is a TSV File with the Format: Registration (matriculation number, last name, first name, lecture, semester).
1234 Graph Polly Big Data WiSe15
5678 Conda Anna Big Data WiSe16
9012 Jeego Hugh Big Data WiSe16
1234 Graph Polly Data Mining WiSe16
3456 Downe Sid Data Mining WiSe16
package bigdata;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.SparkConf;
public class RelCount {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("RelCount");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD allRows = sc.textFile("file");
JavaRDD line = allRows.map(l->Arrays.asList(l.toString().split("\t")));
}
}
the output of the program should be in this form:
(Big Data, 3)
(Data Mining, 2)
Thanks for your help :)