How can I use Spark to transform a large text file for loading into a relational schema?

Question

I am working with a large data set that is laid out as key:value pairs of the following form. Each new line delimits a record and the data set contains one key:value pair per line.

cat_1/key_1: a value
cat_1/key_2: a value
cat_2/key_3: a value

cat_1/key_1: another value
cat_2/key_3: another value

My goal is to transform this text file into a data frame, the records of which can easily be persisted within a table.

In another programming paradigm, I might iterate over the file and write records to another data structure as newlines are encountered. However, I am looking for a more idiomatic way to accomplish this in Spark.

I am stuck as far as the best approach in Spark for dealing with the \n as the record delimiter after creating a new RDD where each line is mapped to line.split(": ").

you need to provide what have you tried (code wise) so far in order for us to hlep — mamdouh alramadan, Mar 08 '16 at 22:51
Possible duplicate of [Parsing multiline records in Scala](http://stackoverflow.com/q/34157104/1560062) — zero323, Mar 08 '16 at 23:05

score -1 · Answer 1 · answered Mar 08 '16 at 23:48

Spark creates a new element per line. So I'm not sure what the issue with the newline is but you could do something like mapping the data to a case class. The case class defines the schema of the table. Pretty straight forward. The following is essentially a rewrite of the documentation.

case class Data(key: String, value: String)

// Read in data from file
val data = sc.textFile(file://C:/location/of/my/data.txt) 

// Maps comma delimited pairs to caseclass and creates the RDD
val myData = data.map(_.split("\\,",-1)).map(p => Data(p(0), p(1))) 

// To dataframe
val myDataAsDataFrame = myData.toDF()

// Register the table
myDataAsDataFrame.registerTempTable("tableName")

My assumption is that you know what "sc" is. – Jeremy Mar 08 '16 at 23:53 — Jeremy, Mar 08 '16 at 23:53

How can I use Spark to transform a large text file for loading into a relational schema?

1 Answers1