I need to read from a lot of gzips from hdfs, like this: sc.textFile('*.gz') while some of these gzips are corrupted, raises
java.io.IOException: gzip stream CRC failure
stops the whole processing running.
I read the debate here, where someone has the same need, but get no clear solution. Since it's not appropriate to achieve this function within spark (according to the link), is there any way just brutally skip corrupted files? There seem to have hints for scala user, no idea how to deal with it in python.
Or I can only detect corrupted files first, and delete them?
What if I have large amount of gzips, and after a day of running, find out the last one of them are corrupted. The whole day wasted. And having corrupted gzips are quite common.