I am working with a large data set that is laid out as key:value pairs of the following form. Each new line delimits a record and the data set contains one key:value pair per line.
cat_1/key_1: a value
cat_1/key_2: a value
cat_2/key_3: a value
cat_1/key_1: another value
cat_2/key_3: another value
My goal is to transform this text file into a data frame, the records of which can easily be persisted within a table.
In another programming paradigm, I might iterate over the file and write records to another data structure as newlines are encountered. However, I am looking for a more idiomatic way to accomplish this in Spark.
I am stuck as far as the best approach in Spark for dealing with the \n
as the record delimiter after creating a new RDD where each line is mapped to line.split(": ")
.