I have the files collection specified with comma separator, like:
hdfs://user/cloudera/date=2018-01-15,hdfs://user/cloudera/date=2018-01-16,hdfs://user/cloudera/date=2018-01-17,hdfs://user/cloudera/date=2018-01-18,hdfs://user/cloudera/date=2018-01-19,hdfs://user/cloudera/date=2018-01-20,hdfs://user/cloudera/date=2018-01-21,hdfs://user/cloudera/date=2018-01-22
and I'm loading the files with Apache Spark, all in once with:
val input = sc.textFile(files)
Also, I have additional information associated with each file - the unique ID, for example:
File ID
--------------------------------------------------
hdfs://user/cloudera/date=2018-01-15 | 12345
hdfs://user/cloudera/date=2018-01-16 | 09245
hdfs://user/cloudera/date=2018-01-17 | 345hqw4
and so on
As the output, I need to receive the DataFrame with the rows, where each row will contain the same ID, as the ID of the file from which this line was read.
Is it possible to pass this information in some way to Spark in order to be able to associate with the lines?