I have a file which contains many hdfs paths. each HDFS path contains some JSON files. I want to process all those json files which are updated in last 24 hours. For now I am reading the file which contains the paths and processing all the paths sequentially. But its taking alot of time.
After processing I also need to add it to hive table.So Hivecontext might also come in picture while processing it parallely. So the question is how can I process them parallel.
I tried:
- reading file which contains paths using sc.textfile and use foreach in which I faced "filesystem object is not serialized error" So I made a list of all recently updated files in a list (I did it sequencially) and tried to process that list in parallel(by calling sc.parallelize and foreach).But then I got nullpointerexception.
Now I am not sure how to proceed.