0

I am working with a large data set that is laid out as key:value pairs of the following form. Each new line delimits a record and the data set contains one key:value pair per line.

cat_1/key_1: a value
cat_1/key_2: a value
cat_2/key_3: a value

cat_1/key_1: another value
cat_2/key_3: another value

My goal is to transform this text file into a data frame, the records of which can easily be persisted within a table.

In another programming paradigm, I might iterate over the file and write records to another data structure as newlines are encountered. However, I am looking for a more idiomatic way to accomplish this in Spark.

I am stuck as far as the best approach in Spark for dealing with the \n as the record delimiter after creating a new RDD where each line is mapped to line.split(": ").

zero323
  • 322,348
  • 103
  • 959
  • 935
slachterman
  • 1,515
  • 4
  • 17
  • 23

1 Answers1

-1

Spark creates a new element per line. So I'm not sure what the issue with the newline is but you could do something like mapping the data to a case class. The case class defines the schema of the table. Pretty straight forward. The following is essentially a rewrite of the documentation.

case class Data(key: String, value: String)

// Read in data from file
val data = sc.textFile(file://C:/location/of/my/data.txt) 

// Maps comma delimited pairs to caseclass and creates the RDD
val myData = data.map(_.split("\\,",-1)).map(p => Data(p(0), p(1))) 

// To dataframe
val myDataAsDataFrame = myData.toDF()

// Register the table
myDataAsDataFrame.registerTempTable("tableName")
Jeremy
  • 587
  • 1
  • 7
  • 20