Process multiple HDFS paths in parallel

Asked Nov 09 '17 at 16:57

Active Nov 09 '17 at 17:07

Viewed 95 times

I have a file which contains many hdfs paths. each HDFS path contains some JSON files. I want to process all those json files which are updated in last 24 hours. For now I am reading the file which contains the paths and processing all the paths sequentially. But its taking alot of time.

After processing I also need to add it to hive table.So Hivecontext might also come in picture while processing it parallely. So the question is how can I process them parallel.

I tried:

reading file which contains paths using sc.textfile and use foreach in which I faced "filesystem object is not serialized error" So I made a list of all recently updated files in a list (I did it sequencially) and tried to process that list in parallel(by calling sc.parallelize and foreach).But then I got nullpointerexception.

Now I am not sure how to proceed.

edited Nov 09 '17 at 17:07

asked Nov 09 '17 at 16:57

Sachin Gaikwad

@user8371915 No its not. That question talks about group by. – Sachin Gaikwad Nov 09 '17 at 17:03
Are all files processed the same way? – ayplam Nov 09 '17 at 17:27
@ayplam yes. processing logic is same for all files – Sachin Gaikwad Nov 10 '17 at 14:28
problem I am facing is I am not able to access HiveContext/sqlContext in foreach logic – Sachin Gaikwad Nov 10 '17 at 14:30

Process multiple HDFS paths in parallel

0 Answers0