4

I am trying to create a Dataframe from RDD[cassandraRow].. But i can't because createDataframe(RDD[Row],schema: StructType) need RDD[Row] not RDD[cassandraRow].

  • How can I achieve this?

And also as per the answer in this question How to convert rdd object to dataframe in spark

( one of the answers ) suggestion for using toDF() on RDD[Row] to get Dataframe from the RDD, is not working for me. I tried using RDD[Row] in another example ( tried to use toDF() ).

  • it's also unknown for me that how can we call the method of Dataframe ( toDF() ) with instance of RDD ( RDD[Row] ) ?

I am using Scala. enter image description here

Community
  • 1
  • 1
Parth Vishvajit
  • 295
  • 4
  • 13

1 Answers1

6

If you really need this you can always map your data to Spark rows:

sqlContext.createDataFrame(
  rdd.map(r => org.apache.spark.sql.Row.fromSeq(r.columnValues)),
  schema
)

but if you want DataFrames it is better to import data directly:

val df = sqlContext
  .read
  .format("org.apache.spark.sql.cassandra")
  .options(Map( "table" -> table, "keyspace" -> keyspace))
  .load()
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    You could also read in your table as `sc.cassandraTable[SomeCaseClass]` but the direct approach above is best :) – RussS Feb 01 '16 at 16:48
  • @zero323 Thanks.. your answer ( code ) is working correctly. Ya.. i know that Dataframes can be achieved directly like you showed. But the case is ( the picture in question was just demo ), I may have a quite large table on my DB side but i want few rows from that. So as per my understanding, We have 2 scenarios to consider, either we can fetch it in the RDD or first make Dataframe and then fetch those rows from it. I choose the RDD way, because (as per my knowledge) we fire the queries on DB directly, Continue... – Parth Vishvajit Feb 02 '16 at 05:44
  • 1
    Continue.. so data ( table ) will be sorted as per operation at DB side and only ResultSet will be returned. But if we use Dataframe, it first load whole the table into the memory and then perform the query on it. so we think that if we just need few rows from very large table we should use RDD. * You can help me, correct me with your knowledge if I am getting something wrong with this above Understanding. Thank you..! – Parth Vishvajit Feb 02 '16 at 05:47
  • @RussS, how would that work? I'm not sure what `SomeCaseClass` needs to know which table to read... – Akhil Nair Sep 27 '17 at 17:51