0

Here's my PIG script:

json = LOAD '/tmp/events/*/*/flume-.*' USING JsonLoader('state:chararray, city:chararray, promotionType:chararray, promotionPlace: chararray, purchase:int');
grouped = FOREACH (group json BY (state, city, promotionType, promotionPlace)) GENERATE group, SUM(json.purchase) as purchase;
grpd = GROUP grouped BY group.city;
top1 = foreach grpd {sorted = order grouped by purchase desc;top = limit sorted 1;generate group, flatten(top);};
DUMP top1;

It works for several files, but for multiple files(3k) it gives error: 'unable to open iterator for alias top1'. Any ideas how to solve this?

Nail Shakirov
  • 654
  • 1
  • 8
  • 17
  • Hard to say, maybe one file over your 3k files is corrupted, or it has not the same schema ? You could try to load and dump a union of the data. – AntonyBrd Oct 26 '15 at 13:25
  • For people who found this post when looking for [ERROR 1066: Unable to open iterator for alias](http://stackoverflow.com/questions/34495085/error-1066-unable-to-open-iterator-for-alias-in-pig-generic-solution) here is a [generic solution](http://stackoverflow.com/a/34495086/983722). – Dennis Jaheruddin Dec 28 '15 at 15:34

1 Answers1

0

If you have code that mostly works, except for some files, here is what you probably want to do when 'thinking harder' doesn't solve the problem:

  1. Find a file in which the error occurs and keep this data
  2. Try the top half of the data, if the error occurs keep that part and go to 1
  3. Try the bottom part (just to be sure), if the error occurs go to 1

Within a few steps you should have only 1 line left that is causing the error, and which should be simple to inspect.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122