1

I need to implement a Java Spark program to count the tuples with the same column value at the given index. whereby the Command line parameters are [Input path] [column index] [output path]. The Input is a TSV File with the Format: Registration (matriculation number, last name, first name, lecture, semester).

1234    Graph   Polly   Big Data    WiSe15
5678    Conda   Anna    Big Data    WiSe16
9012    Jeego   Hugh    Big Data    WiSe16
1234    Graph   Polly   Data Mining WiSe16
3456    Downe   Sid     Data Mining WiSe16
I already started implementing the setup with the configuration but got lost for the rest of the assignment.
package bigdata;

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.SparkConf;

public class RelCount {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("RelCount");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD allRows = sc.textFile("file");
        JavaRDD line = allRows.map(l->Arrays.asList(l.toString().split("\t")));
    }
}

the output of the program should be in this form:

(Big Data, 3)
(Data Mining, 2)

Thanks for your help :)

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420

1 Answers1

1

Tsv is a csv file with a tab as delimiter, so the easiest way is to use the Dataframe API's csv reader to read the file. If required, the dataframe can later be converted into a rdd.

First, get a Spark session:

SparkSession spark = SparkSession.builder()
    .master("local[*]")
    .appName("SparkTest")
    .getOrCreate();

Now the file can be read:

Dataset<Row> df = spark.
    read().
    option("delimiter", "\t").
    option("header", false).
    csv(<path to file>);

As the csv reader takes care of the formatting business, there is no need to split the lines manually anymore.

In the next step, the column names are extracted. As the reader option header as been set to false, the column names will be generic names like _c0, _c1, ... In this example, we group by the fourth column (0-index based), so we select this column name.

int index = 3;
String columnname = df.schema().fieldNames()[index];

As a last step, we group the dataframe by the selected column and count the number of lines per group:

df.groupBy(columnname)
    .count()
    .show();

The output is:

+-----------+-----+
|        _c3|count|
+-----------+-----+
|Data Mining|    2|
|   Big Data|    3|
+-----------+-----+

If required, the result can be also transformed into a rdd:

JavaRDD<Row> rdd = df.groupBy(columnname)
    .count()
    .toJavaRDD();

But usually the dataframe API is much more convenient as the rdd API.

werner
  • 13,518
  • 6
  • 30
  • 45