10

I have a list with more than 30 strings. how to convert list into dataframe . what i tried:

eg

Val list=List("a","b","v","b").toDS().toDF()

Output :


+-------+
|  value|
+-------+
|a      |
|b      |
|v      |
|b      |
+-------+


Expected Output is 


  +---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
|  a|  b|  v|  a|
+---+---+---+---+

any help on this .

ZygD
  • 22,092
  • 39
  • 79
  • 102
senthil kumar p
  • 526
  • 2
  • 7
  • 25
  • Is the list you are reading from a file or table? – Rajat Mishra Jan 26 '17 at 05:00
  • You could try mapping the list to a list of tuples, where _1 represents the position. I forget how to get list position, though. – Adrian M. Jan 26 '17 at 05:03
  • No .I am reading the tag values from xml file .It has more than 30 fields.The XML file is not in structure format so i could not use the databrick API for converting XML into DF. – senthil kumar p Jan 26 '17 at 05:03

3 Answers3

8

In order to use toDF we have to import

import spark.sqlContext.implicits._

Please refer below code

val spark = SparkSession.
builder.master("local[*]")
  .appName("Simple Application")
.getOrCreate()

import spark.sqlContext.implicits._

val lstData = List(List("vks",30),List("harry",30))
val mapLst = lstData.map{case List(a:String,b:Int) => (a,b)}
val lstToDf = spark.sparkContext.parallelize(mapLst).toDF("name","age")
lstToDf.show

val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04- 23",10)).toDF("name","date","duration")
llist.show
Vikas Singh
  • 399
  • 4
  • 8
  • I'm facing an issue using the toDF, I found the the result of calling parallelize is RDD[T] which doesn't have the toDF method – aName Jul 25 '19 at 09:00
  • You are probably forgetting this: ```import spark.sqlContext.implicits._``` – Carlos Aug 01 '19 at 08:24
  • can I directly use mapLst.toDF("name","age") in spark 2.x instead of converting list to RDD using parallelize and then doing toDF? I believe toDF will directly take care of parallelizing the dataset. – Dwarrior Aug 22 '19 at 21:55
7

List("a","b","c","d") represents a record with one field and so the resultset displays one element in each row.

To get the expected output, the row should have four fields/elements in it. So, we wrap around the list as List(("a","b","c","d")) which represents one row, with four fields. In a similar fashion a list with two rows goes as List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))

scala> val list = sc.parallelize(List(("a", "b", "c", "d"))).toDF()
list: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: string, _4: string]

scala> list.show
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+


scala> val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF
list: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: string, _4: string]

scala> list.show
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| a1| b1| c1| d1|
| a2| b2| c2| d2|
+---+---+---+---+
SrinR
  • 923
  • 7
  • 16
  • 27
  • How do I declare this List(("a1","b1","c1","d1"),("a2","b2","c2","d2")). Is it a list[List[String]] ?? I need to take it dynamically from a loop on a table query output. – user1896796 Aug 21 '20 at 05:14
1

this will do:

val data = List(("Value1", "Cvalue1", 123, 2254, 22),("Value1", "Cvalue2", 124, 2255, 23));
val df = spark.sparkContext.parallelize(data).toDF("Col1", "Col2", "Expend1", "Expend2","Expend3");
val cols=Array("Expend1","Expend2","Expend3");
val df1=df
        .withColumn("keys",lit(cols))
        .withColumn("values",array($"Expend1",$"Expend2",$"Expend3"))
        .select($"col1",$"col2",explode_outer(map_from_arrays($"keys", $"values")))
        .show(false)
David Buck
  • 3,752
  • 35
  • 31
  • 35
Anurag P
  • 11
  • 3