You can use the wholeTextFiles to read the rdd. This will read in each file with the filename as the key, and the entire content of the file as the value. From there, you should be able to use the flatMapValues to seperate each record into its own k/v pair.
val input = sc.wholeTextFiles(s3://...)
val inputFlat = input.flatMapValues(line => line.split("\n"))
For this example, if your path was /user/hive/date=December/part-0000 and the contents of part-0000 were
Joe December-28 Something
Ryan December-29 AnotherThing
The output would look like this:
input.take(1)
(/user/hive/date=December/part-0000, Joe December-28 Something\n Ryan December-29 AnotherThing)
inputFlat.take(2)
(/user/hive/date=December/part-0000, Joe December-28 Something)
(/user/hive/date=December/part-0000, Ryan December-29 AnotherThing)
I suppose you could try the following. It would be a bit slow to read the records, but after the repartition you can maxmize the parallel processing
inputFlat.flatMapValues(//some split).repartition(numWorkers)
One other potential thing we could try is using this:
In hive, you can retrieve the file the record was found in using the virtual column named INPUT__FILE__NAME, for example:
select INPUT__FILE__NAME, id, name from users where ...;
I'm not sure it would work, but you could try using that in your .sql api. You would have to make sure your sqlContext has hive-site.xml.