0

i have two dataframes 1 is books1 with Schema

root
|-- asin: string (nullable = true)
|-- helpful: array (nullable = true)
|    |-- element: long (containsNull = true)
|-- overall: double (nullable = true)
|-- reviewText: string (nullable = true)
|-- reviewTime: string (nullable = true)
|-- reviewerID: string (nullable = true)
|-- reviewerName: string (nullable = true)
|-- summary: string (nullable = true)
|-- unixReviewTime: long (nullable = true) 

and another is label with schema

root
 |-- value: integer (nullable = false)

books1 and label contains

enter image description here

but now when i am joining them with join command,

var bookdf = books1.join(label) the output is not correct enter image description here

value field should have contain 2,6,0 but it is containing only 2 why it is happening no. of rows in both the dataframes are same

ernest_k
  • 44,416
  • 5
  • 53
  • 99

1 Answers1

0

You can't join two dataframes whithout providing the joining expression

If both the dataframe have the same number of rows then you can create a new column as id which is a row number for both dataframe as

val newBookDF = spark.sqlContext.createDataFrame(
  book1.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(book1.schema.fields :+ StructField("index", LongType, false))
)

And same for the label dataframe

val newLabelDF = spark.sqlContext.createDataFrame(
  label.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(label.schema.fields :+ StructField("index", LongType, false))
)

Now you can join final two dataframes like

newBookDF.join(newLabelDF, Seq("id")).drop("id")

This will give you result as you expected

koiralo
  • 22,594
  • 6
  • 51
  • 72