0

Reading data from json(dynamic schema) and i'm loading that to dataframe.

Example Dataframe:

scala> import spark.implicits._
import spark.implicits._

scala> val DF = Seq(
     (1, "ABC"),
     (2, "DEF"),
     (3, "GHIJ")
     ).toDF("id", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]

scala> DF.show
+------+-----+
|id    | word|
+------+-----+
|     1|  ABC|
|     2|  DEF|
|     3| GHIJ|
+------+-----+

Requirement: Column count and names can be anything. I want to read rows in loop to fetch each column one by one. Need to process that value in subsequent flows. Need both column name and value. I'm using scala.

Python:
for i, j in df.iterrows(): 
    print(i, j) 

Need the same functionality in scala and it column name and value should be fetched separtely.

Kindly help.

SCouto
  • 7,808
  • 5
  • 32
  • 49
Raja
  • 507
  • 1
  • 6
  • 24

1 Answers1

2

df.iterrows is not from pyspark, but from pandas. In Spark, you can use foreach :

DF
  .foreach{_ match {case Row(id:Int,word:String) => println(id,word)}}

Result :

(2,DEF)
(3,GHIJ)
(1,ABC)

I you don't know the number of columns, you cannot use unapply on Row, then just do :

DF
  .foreach(row => println(row))

Result :

[1,ABC]
[2,DEF]
[3,GHIJ]

And operate with row using its methods getAs etc

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • Thanks Ram & Raphael. I'm trying to generate dynamic Put statement to load the streaming data into hbase. Any other best solution to consider apart from shc connector? – Raja Jun 04 '20 at 09:06
  • see [my answer writeHbase](https://stackoverflow.com/a/56297890/647053) may be you need to tailor according to your needs since you have dynamic columns – Ram Ghadiyaram Jun 04 '20 at 09:11
  • How i can get dynamic column name inside "getAs"? Thanks – Raja Jun 04 '20 at 09:36