0

I wrote a simple UDF in python for PIG

@outputSchema("date: chararray")
def to_date2(dt):
  print dt
  a = dt.split("/")
  print a
  return "20" + a[2] + a[0] + a[1]

I am reading a CSV which has a column like "1/1/17" and I am trying to convert it into 201711 using this UDF.

I get an error

Caused by: Traceback (most recent call last): 
   File "/tmp/pig715837049480092569tmp/util2.py", line 20, in to_date2 
 IndexError: index out of range: 2 

but I can't see any of my print statements in the job run log.

If I am writing a python UDF how to I print things into the Hadoop log? so that I can see what was the value of dt which was passed to my function?

Knows Not Much
  • 30,395
  • 60
  • 197
  • 373
  • You would look at the YARN (or Tez) UI for your job output. The real solution is to put a try catch in your code, though – OneCricketeer May 29 '18 at 12:19

1 Answers1

0

I feel like you should use strptime to parse dates, not split

from datetime import date
dt = date.strptime(dt, '%d/%m/%y')
return dt.strftime('%Y%m%d')

If you want to remove the leading zeros (which you should keep, or at least keep some delimiter to avoid 201711 being parsed as Nov 2017), see this answer

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • yes. that's fine. but how do I perform printing into Hadoop logs? – Knows Not Much May 29 '18 at 14:38
  • As mentioned in comments above. It's in the YARN UI port 8088 of the ResourceManager. You could also use Pig to dump your CSV column as-is in order to debug it. Pig also has string splitting and date parsing functions – OneCricketeer May 30 '18 at 01:39