0

I am trying to run a pig-script on bulk wikipedia page statistics data. To start off with, I am just doing a basic filter like:

A = LOAD '/data' using PigStorage(' ') as (project:chararray, page:chararray, requests:int, size:int);
B= FILTER A BY project == 'en';
dump B;

This is working fine if I am loading 2-3 files but erroring out if I load all the files. The error is :

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias B

To confirm that there are no corrupted records, I made several copies of the file that was working and ran the above script, but no luck. Please advise!

nobody
  • 10,892
  • 8
  • 45
  • 63
Mukund Rao
  • 1
  • 1
  • 4
  • I think you are doing 3 operations here, load, filter and dump.. Could you just run these operations separately for all files and see which one is actually giving you the error? – vmachan Jan 27 '16 at 23:20
  • @vmachan I tried these commands individually, The dump is the one giving error. – Mukund Rao Jan 27 '16 at 23:21
  • In that case, check this [SO Post](http://stackoverflow.com/questions/20350122/error-1066-unable-to-open-iterator-for-alias-pig) which seems to address similar issues – vmachan Jan 27 '16 at 23:22
  • hmm, I already saw that, I'm on hadoop 2.7 & Pig 0.15. – Mukund Rao Jan 27 '16 at 23:31
  • Without a dump or store statement, the job will not actually be executed but only Pig frontend is run (to check syntax and make the execution plan). This means that the job only fails when actually executed which usually indicates data corruption. Could also be that you're running out of memory somewhere... – LiMuBei Jan 28 '16 at 14:36
  • @LiMuBei is correct - script will actually execute only after dump or store statements, and thus your script might be failing at any stage - not necessarily at dump. I would start with removing all the operations (filter in your case) and try LOAD and DUMP first. If that fails, that would mean either you have 1. out of memory or 2.corrupted data. in both the cases I will start small and keep increasing input data to figure out where it fails. – Gaurav Phapale Jan 29 '16 at 18:45
  • Thanks for the input @LiMuBei & Gaurav . I resized my cluster and I am still facing similar issue. Dump is the one giving issue. But it works fine with 2-3 files. Yet again I made 100% sure that the data is not corrupted. – Mukund Rao Feb 01 '16 at 19:50
  • What I fail to understand is if I run the same job twice on the same files, it succeeds sometimes and fails other times with the "unable to open iterator" error. I have spent considerable amount of time to debug this :/ – Mukund Rao Feb 01 '16 at 20:23
  • Config Details: Namenode: 32 GB RAM (m4.2xlarge in AWS) 3 x Datanodes each with 16 GB RAM HDFS: 240 GB 60 GB in each DATANODE – Mukund Rao Feb 01 '16 at 20:23
  • Do you get any errors in the mapper/reducer logs? – LiMuBei Feb 02 '16 at 08:48
  • @LiMuBei This is weird, on the attempts that fail, I see under userlogs/application/container an error that says unable to connect to namenode on some port. This port differs every time for every application and there are no processes running on that port in the namenode. – Mukund Rao Feb 03 '16 at 20:31

0 Answers0