Loading in the data:
SparkConf sc= new SparkConf().setAppName("TEST").setMaster("local[*]");
JavaSparkContext JSC = new JavaSparkContext(sc);
JavaRDD<String> stringRDDVotes = JSC.textFile("HarryPotter.csv");
I currently have this table loaded into an RDD:
ID | A | B | Name |
---|---|---|---|
1 | 23 | 50 | Harry;Potter |
I want to convert it to the table below:
ID | A | B | Name |
---|---|---|---|
1 | 23 | 50 | Harry |
1 | 23 | 50 | Potter |
All the solutions I found use SparkSQL which I can't use, so how would I get this result using only things like flatMap
and mapToPair
.
Something like this maybe?
flatMap(s -> Arrays.asList(s.split(";")).iterator())
The code above produces this:
ID | A | B | Name |
---|---|---|---|
1 | 23 | 50 | Harry |
Potter |
I know that in scala it can be done like this, but I don't know how to it with java:
val input: RDD[String] = sc.parallelize(Seq("1,23,50,Harry;Potter"))
val csv: RDD[Array[String]] = input.map(_.split(','))
val result = csv.flatMap { case Array(s1, s2, s3, s4) => s4.split(";").map(part => (s1, s2, s3, part)) }