How to troubleshoot a python UDF in PIG

Question

I wrote a simple UDF in python for PIG

@outputSchema("date: chararray")
def to_date2(dt):
  print dt
  a = dt.split("/")
  print a
  return "20" + a[2] + a[0] + a[1]

I am reading a CSV which has a column like "1/1/17" and I am trying to convert it into 201711 using this UDF.

I get an error

Caused by: Traceback (most recent call last): 
   File "/tmp/pig715837049480092569tmp/util2.py", line 20, in to_date2 
 IndexError: index out of range: 2

but I can't see any of my print statements in the job run log.

If I am writing a python UDF how to I print things into the Hadoop log? so that I can see what was the value of dt which was passed to my function?

You would look at the YARN (or Tez) UI for your job output. The real solution is to put a try catch in your code, though — OneCricketeer, May 29 '18 at 12:19

score 0 · Answer 1 · answered May 29 '18 at 12:34

0

I feel like you should use strptime to parse dates, not split

from datetime import date
dt = date.strptime(dt, '%d/%m/%y')
return dt.strftime('%Y%m%d')

If you want to remove the leading zeros (which you should keep, or at least keep some delimiter to avoid 201711 being parsed as Nov 2017), see this answer

answered May 29 '18 at 12:34

OneCricketeer

179,855
19
132
245

yes. that's fine. but how do I perform printing into Hadoop logs? – Knows Not Much May 29 '18 at 14:38
As mentioned in comments above. It's in the YARN UI port 8088 of the ResourceManager. You could also use Pig to dump your CSV column as-is in order to debug it. Pig also has string splitting and date parsing functions – OneCricketeer May 30 '18 at 01:39

How to troubleshoot a python UDF in PIG

1 Answers1