Non-newline oriented input in Spark

Asked Aug 05 '15 at 13:26

Active Aug 05 '15 at 15:20

Viewed 106 times

I am using PySpark (Spark 1.4.1) to process some data. The raw data file looks like this

2015-06-07 14:44:56.09
username='Maria'
age=22

2015-06-07 14:44:56.10
username='tom'
age=38

When I read-in the text file

text_rdd = sc.textFile('somefile.txt')

and look at the records, each line is treated as an individual record. Is it possible to somehow read multline inputs into one record?

Based on this http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-td2750.html , you have to glom the partitions and join the records into 1 string. The link, however, is from 2014 and I was wondering if anyone had any good solutions for this situation.

Many thanks!

edited Aug 05 '15 at 15:04

zero323

asked Aug 05 '15 at 13:26

Winterflower

0 Answers0