PySpark: read files without knowing the key of very single row

Question

If you are reading files with Apache Spark (I'm using PySpark) one would expect a key in each row. For instance like this

key1, timestamp1, value1
key2, timestamp2, value2
key1, timestamp3, value3
key1, timestamp4, value4

which is then reduced by keys to

key1 {{timestamp1, value1}, {timestamp3, value3}, {timestamp4, value4}}
key2 {{timestamp2, value2}}

This is best-practice because you never know at which line the file is split when reading from HDFS and having the key in each rows helps using map-reduce. But my input file looks like differently:

key1
timestamp1, value1
timestamp3, value3
-------- split --------
timestamp4, value4
key2
timestamp2, value2
...

The problem is that HDFS might split the file at a random location so when the second node of a Apache Hadoop Spark cluster reads the second part of the file, it would start with {timestamp4, value4} without knowing that it belongs to key1.

Is there a way to solve that problem? I would like to avoid transforming the input files into another format on a local machine before going into the cluster.

Maybe using a custom file-splitter? I'm looking for a solution in Python 2.7 (Pyspark).

Thanks for any hint!

each key has a list (vector) of several timestamp-value-pairs. For instance a signal that has been measured at several times (observations). Usually you would add the key to each row (observation). But in this case my input file comes from someone else and is missing the key in each row. — Matthias, Jun 29 '16 at 18:32
You'll have to be more precise than this. In some cases you can leverage records structure as [shown here](http://stackoverflow.com/q/31227363/1560062) otherwise you'll have to create custom Hadoop input format or try to stitch records later. — zero323, Jun 29 '16 at 18:32
Wow, I think that's exactly what I've been looking for. Will try that and let you know. Many thanks! Your example is explaining the same issue though an event-ID might occur at any line and following lines are related to that event. So it needs to make sure that a chunk is processed by Spark starting at the event-ID. Perfect! — Matthias, Jun 29 '16 at 18:39
Hmm, no solution yet. I tried following your example, but running into problems. Maybe you can have a look at my [new question](http://stackoverflow.com/questions/38117391/pyspark-reading-from-multiline-record-textfile-with-newapihadoopfile). — Matthias, Jun 30 '16 at 09:09

PySpark: read files without knowing the key of very single row

0 Answers0