Combining dataframes in java

Question

How can i combine a dataframe with a single column(Description) with another dataframe having 2 columns (Name, Caption), so that my resultant dataframe will contain 3 columns(Name,Caption,Description)

I tried join, but it results in Cartesian join. All the names and Captions mapped to all descriptions — Seena V P, Nov 07 '16 at 04:40
Is there a key for the join ? How do you determine which description matches which name ? — Ramachandran.A.G, Nov 07 '16 at 05:06
no key for joining. first name matches with first description,second one matches with second description and so on — Seena V P, Nov 07 '16 at 05:34
please share your views. i am really stuck at this point. Please some one help me out of this — Seena V P, Nov 07 '16 at 08:45
See http://stackoverflow.com/questions/32882529/how-to-zip-twoor-more-dataframe-in-spark. — Alexey Romanov, Nov 07 '16 at 09:27

Ramachandran.A.G · Accepted Answer · 2016-11-08T10:14:54.690

I am providing a solution in scala. Now i can add this as a comment , but for the formatting and image that i am attaching , providing this as an answer. I am pretty much sure that there must be an equivalent in Java as well for this

val nameCaptionDataFrame = Seq(("name1","caption1"),("name2","caption2"),("name3","caption3"),("name4","caption4")).toDF("name","caption")
val descriptionDataFrame = List("desc1","desc2","desc3","desc4").toDF("description")
val nameCaptionDataFrameWithId = nameCaptionDataFrame.withColumn("nameId",monotonically_increasing_id())
nameCaptionDataFrameWithId.show
val descriptionDataFrameId = descriptionDataFrame.withColumn("descId",monotonically_increasing_id())
descriptionDataFrameId.show
nameCaptionDataFrameWithId.join(descriptionDataFrameId, nameCaptionDataFrameWithId.col("nameId") === descriptionDataFrameId.col("descId")).show

Here is the sample output of this piece of code . I hope you will be able to take the idea from here (API's are consistent i assume) and do this in Java

** EDITS IN JAVA ** A "translation" of the code would look similar to this.

/**
 * Created by RGOVIND on 11/8/2016.
 */

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.*;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class SparkMain {
    static public void main(String[] args) {
        SparkConf conf = new SparkConf().setMaster("local").setAppName("Stack Overflow App");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sc);

        List<Tuple2<String, String>> tuples = new ArrayList<Tuple2<String, String>>();
        tuples.add(new Tuple2<String, String>("name1", "caption1"));
        tuples.add(new Tuple2<String, String>("name3", "caption2"));
        tuples.add(new Tuple2<String, String>("name3", "caption3"));

        List<String> descriptions = Arrays.asList(new String[]{"desc1" , "desc2" , "desc3"});

        Encoder<Tuple2<String, String>> nameCaptionEncoder = Encoders.tuple(Encoders.STRING(), Encoders.STRING());
        Dataset<Tuple2<String, String>> nameValueDataSet = sqlContext.createDataset(tuples, nameCaptionEncoder);
        Dataset<String> descriptionDataSet = sqlContext.createDataset(descriptions, Encoders.STRING());
        Dataset<Row> nameValueDataSetWithId = nameValueDataSet.toDF("name","caption").withColumn("id",functions.monotonically_increasing_id()).select("*");
        Dataset<Row> descriptionDataSetId = descriptionDataSet.withColumn("id",functions.monotonically_increasing_id()).select("*");
        nameValueDataSetWithId.join(descriptionDataSetId ,"id").show();
    }
}

This prints the below. Hope this helps

Tnaks for your reply. But i have an issue. "The method withColumn(String, Column) in the type Dataset is not applicable for the arguments (String, MonotonicallyIncreasingID)". withColumn function not supporting MonotonicallyIncreasingID — Seena V P, Nov 07 '16 at 12:32
@SVP added a java sample. I am not 100% sure on performance or running this on a cluster. Works on my standalone node — Ramachandran.A.G, Nov 08 '16 at 10:16

Combining dataframes in java

1 Answers1