If you are reading files with Apache Spark (I'm using PySpark) one would expect a key in each row. For instance like this
key1, timestamp1, value1
key2, timestamp2, value2
key1, timestamp3, value3
key1, timestamp4, value4
which is then reduced by keys to
key1 {{timestamp1, value1}, {timestamp3, value3}, {timestamp4, value4}}
key2 {{timestamp2, value2}}
This is best-practice because you never know at which line the file is split when reading from HDFS and having the key in each rows helps using map-reduce. But my input file looks like differently:
key1
timestamp1, value1
timestamp3, value3
-------- split --------
timestamp4, value4
key2
timestamp2, value2
...
The problem is that HDFS might split the file at a random location so when the second node of a Apache Hadoop Spark cluster reads the second part of the file, it would start with {timestamp4, value4} without knowing that it belongs to key1.
Is there a way to solve that problem? I would like to avoid transforming the input files into another format on a local machine before going into the cluster.
Maybe using a custom file-splitter? I'm looking for a solution in Python 2.7 (Pyspark).
Thanks for any hint!