I am using PySpark (Spark 1.4.1) to process some data. The raw data file looks like this
2015-06-07 14:44:56.09
username='Maria'
age=22
2015-06-07 14:44:56.10
username='tom'
age=38
When I read-in the text file
text_rdd = sc.textFile('somefile.txt')
and look at the records, each line is treated as an individual record. Is it possible to somehow read multline inputs into one record?
Based on this http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-td2750.html , you have to glom the partitions and join the records into 1 string. The link, however, is from 2014 and I was wondering if anyone had any good solutions for this situation.
Many thanks!