Import data using Spark Scala

Question

I have a large Data set which i want to import into databricks to do some analytics using scala. The data set is available in this link : https://drive.google.com/open?id=1g4YYALk3nArN8bX2uFS70IpbdSf_Efqj

I want to import this data set such that , the document ID is in the first column and the other test data in the other column.

But when i import the data using following code , it looks like this

val df = spark.read.text("FileStore/tables/plot_summaries.txt")

df.select("value").show()

Can anyone help me to import this in the proper way ? Any help would be highly appreciated. Thank you

Does this answer your question? [Reading TSV into Spark Dataframe with Scala API](https://stackoverflow.com/questions/33898040/reading-tsv-into-spark-dataframe-with-scala-api) — Shaido, Mar 04 '20 at 08:55

score 4 · Answer 1 · answered Mar 04 '20 at 06:34

4

This will solve your issue.

spark.read.option("sep", "\t").text("FileStore/tables/plot_summaries.txt")

answered Mar 04 '20 at 06:34

Vijay

123
6

score 3 · Accepted Answer · answered Mar 04 '20 at 06:48

You have data with tab, so you need to provide a delimiter externally.

scala> import org.apache.spark.sql.types._
scala> val schema = new StructType().add("DocumentID", LongType, true).add("Description", StringType, true)

scala> val df = spark.read.format("csv").option("delimiter", "\t").schema(schema).load("/plot_summaries.txt")

scala> df.show(10)
+----------+--------------------+
|DocumentID|         Description|
+----------+--------------------+
|  23890098|Shlykov, a hard-w...|
|  31186339|The nation of Pan...|
|  20663735|Poovalli Induchoo...|
|   2231378|The Lemon Drop Ki...|
|    595909|Seventh-day Adven...|
|   5272176|The president is ...|
|   1952976|{{plot}} The film...|
|  24225279|The story begins ...|
|   2462689|Infuriated at bei...|
|  20532852|A line of people ...|
+----------+--------------------+

can you help and suggest how to handle this https://stackoverflow.com/questions/62036791/while-writing-to-hdfs-path-getting-error-java-io-ioexception-failed-to-rename — BdEngineer, May 27 '20 at 06:49

Import data using Spark Scala

2 Answers2