0

How can i combine a dataframe with a single column(Description) with another dataframe having 2 columns (Name, Caption), so that my resultant dataframe will contain 3 columns(Name,Caption,Description)

Seena V P
  • 934
  • 3
  • 9
  • 26

1 Answers1

0

I am providing a solution in scala. Now i can add this as a comment , but for the formatting and image that i am attaching , providing this as an answer. I am pretty much sure that there must be an equivalent in Java as well for this

val nameCaptionDataFrame = Seq(("name1","caption1"),("name2","caption2"),("name3","caption3"),("name4","caption4")).toDF("name","caption")
val descriptionDataFrame = List("desc1","desc2","desc3","desc4").toDF("description")
val nameCaptionDataFrameWithId = nameCaptionDataFrame.withColumn("nameId",monotonically_increasing_id())
nameCaptionDataFrameWithId.show
val descriptionDataFrameId = descriptionDataFrame.withColumn("descId",monotonically_increasing_id())
descriptionDataFrameId.show
nameCaptionDataFrameWithId.join(descriptionDataFrameId, nameCaptionDataFrameWithId.col("nameId") === descriptionDataFrameId.col("descId")).show

Here is the sample output of this piece of code . I hope you will be able to take the idea from here (API's are consistent i assume) and do this in Java

enter image description here

** EDITS IN JAVA ** A "translation" of the code would look similar to this.

/**
 * Created by RGOVIND on 11/8/2016.
 */

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.*;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class SparkMain {
    static public void main(String[] args) {
        SparkConf conf = new SparkConf().setMaster("local").setAppName("Stack Overflow App");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sc);

        List<Tuple2<String, String>> tuples = new ArrayList<Tuple2<String, String>>();
        tuples.add(new Tuple2<String, String>("name1", "caption1"));
        tuples.add(new Tuple2<String, String>("name3", "caption2"));
        tuples.add(new Tuple2<String, String>("name3", "caption3"));

        List<String> descriptions = Arrays.asList(new String[]{"desc1" , "desc2" , "desc3"});

        Encoder<Tuple2<String, String>> nameCaptionEncoder = Encoders.tuple(Encoders.STRING(), Encoders.STRING());
        Dataset<Tuple2<String, String>> nameValueDataSet = sqlContext.createDataset(tuples, nameCaptionEncoder);
        Dataset<String> descriptionDataSet = sqlContext.createDataset(descriptions, Encoders.STRING());
        Dataset<Row> nameValueDataSetWithId = nameValueDataSet.toDF("name","caption").withColumn("id",functions.monotonically_increasing_id()).select("*");
        Dataset<Row> descriptionDataSetId = descriptionDataSet.withColumn("id",functions.monotonically_increasing_id()).select("*");
        nameValueDataSetWithId.join(descriptionDataSetId ,"id").show();
    }
}

This prints the below. Hope this helps

enter image description here

Ramachandran.A.G
  • 4,788
  • 1
  • 12
  • 24
  • Tnaks for your reply. But i have an issue. "The method withColumn(String, Column) in the type Dataset is not applicable for the arguments (String, MonotonicallyIncreasingID)". withColumn function not supporting MonotonicallyIncreasingID – Seena V P Nov 07 '16 at 12:32
  • @SVP added a java sample. I am not 100% sure on performance or running this on a cluster. Works on my standalone node – Ramachandran.A.G Nov 08 '16 at 10:16