0

I am trying to run the following code

        SparkSession spark = SparkSession
                .builder()
                .appName("test")
                .master("local")
//                .enableHiveSupport()
                .getOrCreate();
        List<String> list=new ArrayList<String>();
        list.add("HI");
        list.add("HI");
        list.add("HI");
        Dataset<Row> dataDs = spark.createDataset(list, Encoders.STRING()).toDF();
        List<String> list2=new ArrayList<String>();
        list2.add("1");
        list2.add("2");
        list2.add("3");
        Dataset<Row> dataDs2 = spark.createDataset(list2, Encoders.STRING()).toDF().withColumnRenamed("value","newvalue");
        Column col=dataDs2.col("newvalue");
        dataDs=dataDs.withColumn("newcol",col);
        dataDs.show();

However, an error is popping up saying that

Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) newvalue#10 missing from value#1 in operator !Project [value#1, newvalue#10 AS newcol#13];; !Project [value#1, newvalue#10 AS newcol#13]

When I searched about it online, it says there might be a case of duplicate column names. However, my columns names are different. dataDs has column name as 'value' while dataDs2 has column name 'newvalue'. So, I am not getting why the error is still happening. Can someone help me out?

Anshul Dubey
  • 117
  • 10

1 Answers1

0

The problem is here:

Column col=dataDs2.col("newvalue");
dataDs=dataDs.withColumn("newcol",col);

You col is a column from dataDs2() you can not use it in dataDS.

It looks like you want to zip() two dataframes. there is RDD.zip() function for it. See more methods here: How to zip two (or more) DataFrame in Spark

Artem Aliev
  • 1,362
  • 7
  • 12