0

I know this is one of the most repeated question. I have looked almost everywhere and none of the resources could resolve the issue I am facing. Below is the simplified version of my problem statement. But in actual data is little complex so I have to use UDF

My input File: (input.txt)

NotNeeded1,NotNeeded11;Needed1
NotNeeded2,NotNeeded22;Needed2

I want the output to be

Needed1
Needed2

So, I am writing the below UDF (Java code):

package com.company.pig;

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class myudf extends EvalFunc<String>{
    public String exec(Tuple input) throws IOException {
        if (input == null || input.size() == 0)
            return null;
        String s = (String)input.get(0);
        String str = s.split("\\,")[1];
        String str1 = str.split("\\;")[1];
        return str1;
    }
}

And packaging it into

rollupreg_extract-jar-with-dependencies.jar

Below is my pig shell code

grunt> REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
grunt> DEFINE myudf com.company.pig.myudf;
grunt> data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(',');
grunt> extract = FOREACH data GENERATE myudf($1);
grunt> DUMP extract;

And I get the below error:

2017-05-15 15:58:15,493 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2017-05-15 15:58:15,577 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2017-05-15 15:58:15,659 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2017-05-15 15:58:15,774 [main] INFO  org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2017-05-15 15:58:15,865 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2017-05-15 15:58:15,923 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2017-05-15 15:58:15,923 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2017-05-15 15:58:16,184 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:16,196 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:16,396 [main] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:16,576 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2017-05-15 15:58:16,580 [main] WARN  org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
2017-05-15 15:58:16,584 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2017-05-15 15:58:16,588 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2017-05-15 15:58:17,258 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/pig/rollupreg_extract-jar-with-dependencies.jar to DistributedCache through /tmp/temp-1119775568/tmp-858482998/rollupreg_extract-jar-with-dependencies.jar
2017-05-15 15:58:17,276 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2017-05-15 15:58:17,294 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2017-05-15 15:58:17,295 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2017-05-15 15:58:17,295 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2017-05-15 15:58:17,354 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2017-05-15 15:58:17,510 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:17,511 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:17,511 [JobControl] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:17,753 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2017-05-15 15:58:17,820 [JobControl] INFO  org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
2017-05-15 15:58:17,830 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-05-15 15:58:17,830 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2017-05-15 15:58:17,884 [JobControl] INFO  com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2017-05-15 15:58:17,889 [JobControl] INFO  com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev 7a4b57bedce694048432dd5bf5b90a6c8ccdba80]
2017-05-15 15:58:17,922 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2017-05-15 15:58:18,525 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2017-05-15 15:58:18,692 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1494853652295_0023
2017-05-15 15:58:18,879 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2017-05-15 15:58:18,973 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1494853652295_0023
2017-05-15 15:58:19,029 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1494853652295_0023/
2017-05-15 15:58:19,030 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1494853652295_0023
2017-05-15 15:58:19,030 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases data,extract
2017-05-15 15:58:19,030 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: data[2,7],extract[3,10] C:  R:
2017-05-15 15:58:19,044 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2017-05-15 15:58:19,044 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1494853652295_0023]
2017-05-15 15:58:29,156 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2017-05-15 15:58:29,156 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1494853652295_0023 has failed! Stop running all dependent jobs
2017-05-15 15:58:29,157 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2017-05-15 15:58:29,790 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:29,791 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:29,793 [main] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:30,311 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:30,312 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:30,313 [main] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:30,465 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2017-05-15 15:58:30,467 [main] WARN  org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
2017-05-15 15:58:30,472 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
2.7.3.2.5.0.0-1245              root    2017-05-15 15:58:16     2017-05-15 15:58:30     UNKNOWN

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_1494853652295_0023  data,extract    MAP_ONLY        Message: Job failed!    hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225,

Input(s):
Failed to read data from "/pig_hdfs/input.txt"

Output(s):
Failed to produce result in "hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1494853652295_0023


2017-05-15 15:58:30,472 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2017-05-15 15:58:30,499 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias extract
Details at logfile: /pig/pig_1494863836458.log

I know it complaints that

Failed to read data from "/pig_hdfs/input.txt"

But I am sure this is not the actual issue. If I don't use the udf and directly dump the data, I get the output. So, this is not the issue.

Rahul
  • 319
  • 2
  • 7
  • 15

1 Answers1

1

First, you do not need an udf to get the desired output.You can use semi colon as the delimiter in load statement and get the needed column.

data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(';');
extract = FOREACH data GENERATE $1;
DUMP extract;

If you insist on using udf then you will have to load the record into a single field and then use the udf.Also,your udf is incorrect.You should split the string s with ';' as the delimiter, which is passed from the pig script.

String s = (String)input.get(0);
String str1 = s.split("\\;")[1];

And in your pig script,you need to load the entire record into 1 field and use the udf on field $0.

REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
DEFINE myudf com.company.pig.myudf;
data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' AS (f1:chararray);
extract = FOREACH data GENERATE myudf($0);
DUMP extract;
nobody
  • 10,892
  • 8
  • 45
  • 63
  • I did it. But pig is giving the same issue. If I just `REGISTER ` the jar and then `DUMP data` after loading it - it gives the same issue in spite of not using the UDF – Rahul May 16 '17 at 07:56
  • You have 2 columns and if you use ; then don't you think PigStorage(';') will split it into 2 columns? PIG is not the issue here.Please check your input and script.Also if you use UDF did you change your java udf and pigscript?What is the point of registering the udf if you are not using it? – nobody May 16 '17 at 13:59
  • My data was complex. Just for keeping it simple I made the sample data for question. Without `REGISTER ` when I issued the `DUMP data` it was loading the data properly. The moment I registered the jar all my `DUMP` and `STORE` commands were failing. I found the solution.. It was failing because of UDF jar. It was because of `org.apache.pig` v0.12 dependency being incompatible. I tried with other versions of the same dependency but it didn't work. Finally dependency in [this](http://stackoverflow.com/a/21396370/2716962) answer solved the issue – Rahul May 16 '17 at 14:24