1

I need to get the average age in each gender group...

Here is my data set:

01::F::21::0001
02::M::31::21345
03::F::22::33323
04::F::18::123
05::M::31::14567

Basically this is

userid::gender::age::occupationid

Since there is multiple delimiter, i read somewhere here in stackoverflow to load it first via TextLoader()

loadUsers  = LOAD '/user/cloudera/test/input/users.dat' USING TextLoader() as (line:chararray); 

testusers = FOREACH loadusers GENERATE FLATTEN(STRSPLIT(line,'::')) as (user:int,  gender:chararray,  age:int, occupation:int);

grunt> DESCRIBE testusers;
testusers: {user: int,gender: chararray,age: int,occupation: int}

grouped_testusers = GROUP testusers BY gender;
average_age_of_testusers = FOREACH grouped_testusers GENERATE group, AVG(testusers.age);

after running

dump average_age_of_testusers

this is the error in hdfs

2016-10-31 13:39:22,175 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - 
ERROR 0: Exception while executing (Name: grouped_testusers: Local Rearrange[tuple]{chararray}(false) - scope-284 Operator Key: scope-284): org.apache.pig.backend.executionengine.ExecException: 
ERROR 2106: Error while computing average in Initial 2016-10-31 13:39:22,175 [main] 

ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!

Input(s):
Failed to read data from "/user/cloudera/test/input/users.dat"

Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-169204712/tmp-1755697117"

This is my first try in programming in pig, so forgive me if the solution is very obvious.

Analyzing it further, it seems it has trouble computing the average, i thought i made a mistake in data type but age is int.

if you can help me, thank you.

bencampbell_14
  • 587
  • 2
  • 10
  • 32
  • The error mentions that the data could not be read from the input file.Failed to read data from "/user/cloudera/test/input/users.dat".After the load statement try DUMP loadUsers; to ensure you are loading the data. – nobody Oct 31 '16 at 21:19
  • @inquisitive_mind i dump the loadUsers and it is working. i dump the testusers and it is also working. im a bit stucked here though. – bencampbell_14 Nov 01 '16 at 08:26

1 Answers1

1

I figured out the problem in this one. Please refer to How can correct data types on Apache Pig be enforced? for a better explanation.

But then, just to show what I did... I had to cast my data

FOREACH loadusers GENERATE FLATTEN((tuple(int,chararray,int,int))  STRSPLIT(line,'::')) as (user:int,      gender:chararray,  age:int, occupation:int);

AVG is failing because loadusers.age is being treated as string instead of int.

Community
  • 1
  • 1
bencampbell_14
  • 587
  • 2
  • 10
  • 32
  • I am sure this works but wonder why. After all, your describe after testusers shows age to be an int. Was the field empty perhaps? because it is obviously not a char at the moment when it goes into the AVG. – Dennis Jaheruddin Nov 01 '16 at 10:07
  • @DennisJaheruddin yeah i know what you meant there, thats why i felt helpless at first since i dont know what im missing. deleted/copied the file in hdfs making sure it has the right permission etc, i dump the loadusers and testusers and they were all working, it is when i applied the ave that the error manifested. Actually, i did try to load the same set without the FOREACH GENERATE FLATTEN STRSPLIT, i use the ':' as delimiter and just made the extra : as bytearray (since the original delimiter is really ::) when i loaded it, i applied the same ave and it worked. – bencampbell_14 Nov 01 '16 at 10:13