-1

I have a large Data set which i want to import into databricks to do some analytics using scala. The data set is available in this link : https://drive.google.com/open?id=1g4YYALk3nArN8bX2uFS70IpbdSf_Efqj

I want to import this data set such that , the document ID is in the first column and the other test data in the other column.

But when i import the data using following code , it looks like this

val df = spark.read.text("FileStore/tables/plot_summaries.txt")

df.select("value").show()

enter image description here

Can anyone help me to import this in the proper way ? Any help would be highly appreciated. Thank you

student_R123
  • 962
  • 11
  • 30
  • Is that a tab between the ID and the rest of the text? – ernest_k Mar 04 '20 at 06:16
  • Does this answer your question? [Reading TSV into Spark Dataframe with Scala API](https://stackoverflow.com/questions/33898040/reading-tsv-into-spark-dataframe-with-scala-api) – Shaido Mar 04 '20 at 08:55

2 Answers2

4

This will solve your issue.

spark.read.option("sep", "\t").text("FileStore/tables/plot_summaries.txt")
Vijay
  • 123
  • 6
3

You have data with tab, so you need to provide a delimiter externally.

scala> import org.apache.spark.sql.types._
scala> val schema = new StructType().add("DocumentID", LongType, true).add("Description", StringType, true)

scala> val df = spark.read.format("csv").option("delimiter", "\t").schema(schema).load("/plot_summaries.txt")

scala> df.show(10)
+----------+--------------------+
|DocumentID|         Description|
+----------+--------------------+
|  23890098|Shlykov, a hard-w...|
|  31186339|The nation of Pan...|
|  20663735|Poovalli Induchoo...|
|   2231378|The Lemon Drop Ki...|
|    595909|Seventh-day Adven...|
|   5272176|The president is ...|
|   1952976|{{plot}} The film...|
|  24225279|The story begins ...|
|   2462689|Infuriated at bei...|
|  20532852|A line of people ...|
+----------+--------------------+
Nikhil Suthar
  • 2,289
  • 1
  • 6
  • 24
  • can you help and suggest how to handle this https://stackoverflow.com/questions/62036791/while-writing-to-hdfs-path-getting-error-java-io-ioexception-failed-to-rename – BdEngineer May 27 '20 at 06:49