I am trying to run a pig-script on bulk wikipedia page statistics data. To start off with, I am just doing a basic filter like:
A = LOAD '/data' using PigStorage(' ') as (project:chararray, page:chararray, requests:int, size:int);
B= FILTER A BY project == 'en';
dump B;
This is working fine if I am loading 2-3 files but erroring out if I load all the files. The error is :
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias B
To confirm that there are no corrupted records, I made several copies of the file that was working and ran the above script, but no luck. Please advise!