2

I am using PySpark (Spark 1.4.1) to process some data. The raw data file looks like this

2015-06-07 14:44:56.09
username='Maria'
age=22

2015-06-07 14:44:56.10
username='tom'
age=38

When I read-in the text file

text_rdd = sc.textFile('somefile.txt')

and look at the records, each line is treated as an individual record. Is it possible to somehow read multline inputs into one record?

Based on this http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-td2750.html , you have to glom the partitions and join the records into 1 string. The link, however, is from 2014 and I was wondering if anyone had any good solutions for this situation.

Many thanks!

zero323
  • 322,348
  • 103
  • 959
  • 935

0 Answers0