0

I am analyzing Cluster user log files with the following code in pig:

     t_data = load 'log_flies/*' using PigStorage(',');
    A = foreach t_data generate $0 as (jobid:int), 
$1 as (indexid:int), $2 as (clusterid:int), $6 as (user:chararray),
 $7 as (stat:chararray), $13 as (queue:chararray), $32 as (projectName:chararray), $52 as (cpu_used:float), $55 as (efficiency:float),  $59 as (numThreads:int), 

$61 as (numNodes:int),  $62 as (numCPU:int),$72 as (comTime:int),
 $73 as (penTime:int),  $75 as (runTime:int), $52/($62*$75) as (allEff: float), SUBSTRING($68, 0, 11) as (endTime: chararray);
    ---describe A;
    A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
    B = group A by user;
    f_data = foreach B {
           grp = group;
           count = COUNT(A);
          avg = AVG(A.cpu_used);
          generate FLATTEN(grp), count, avg;
       };
    f_data = limit f_data 10;
    dump f_data;

Code works for group and COUNT but when I includes AVG and SUM, it shows the errors:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias f_data

I checked data types. All are fine. Do you have any suggestions where I missed it?. Thank you in advance for your help.

Aarav
  • 111
  • 1
  • 10

2 Answers2

1

Its an syntax error. Read http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach (section : Nested foreach) for details.

Pig Script

   A = LOAD 'a.csv' USING  PigStorage(',') AS (user:chararray,    cpu_used:float);
   B = GROUP A BY user;
   C = FOREACH B {
    cpu_used_bag = A.cpu_used;
    GENERATE group AS user, AVG(cpu_used_bag) AS avg_cpu_used, SUM(cpu_used_bag) AS total_cpu_used;
    };

Input : a.csv

a,3
a,4
b,5

Output :

(a,3.5,7.0)
(b,5.0,5.0)
Murali Rao
  • 2,287
  • 11
  • 18
0

Your pig is full of errors

  • do not use same Alias at both side of = ;
  • using PigLoader() as (mention your schema appropriately );

    A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
    

    CHANGE THIS TO F = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;

    f_data = limit f_data 10; CHANGE left F_data with someother name .

    Stop making your life complex. General rule of debugging Pigscript

    • run in local mode
    • dump after everyline

    Wrote a sample pig to mimic ur pig :(working)

t_data = load './file' using PigStorage(',') as (jobid:int,cpu_used:float);

        C = foreach t_data generate jobid, cpu_used ;
        B = group C by jobid ;
        f_data = foreach B {
               count = COUNT(C);
              sum = SUM(C.cpu_used);
              avg = AVG(C.cpu_used);
              generate FLATTEN(group), count,sum,avg;
           };
        never_f_data = limit f_data 10;

    dump never_f_data;
KrazyGautam
  • 2,839
  • 2
  • 21
  • 31