6

I am using datetime in some Python udfs that I use in my pig script. So far so good. I use pig 12.0 on Cloudera 5.5

However, I also need to use the pytz or dateutil packages as well and they dont seem to be part of a vanilla python install.

Can I use them in my Pig udfs in some ways? If so, how? I think dateutil is installed on my nodes (I am not admin, so how can I actually check that is the case?), but when I type:

import sys
#I append the path to dateutil on my local windows machine. Is that correct?
sys.path.append('C:/Users/me/AppData/Local/Continuum/Anaconda2/lib/site-packages')

from dateutil import tz

in my udfs.py script, I get:

2016-08-30 09:56:06,572 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1121: Python Error. Traceback (most recent call last):
  File "udfs.py", line 23, in <module>
    from dateutil import tz
ImportError: No module named dateutil

when I run my pig script.

All my other python udfs (using datetime for instance) work just fine. Any idea how to fix that?

Many thanks!

UPDATE

after playing a bit with the python path, I am now able to

import dateutil 

(at least Pig does not crash). But if I try:

from dateutil import tz

I get an error.

  from dateutil import tz 
  File "/opt/python/lib/python2.7/site-packages/dateutil/tz.py", line 16, in <module>
    from six import string_types, PY3
  File "/opt/python/lib/python2.7/site-packages/six.py", line 604, in <module>
    viewkeys = operator.methodcaller("viewkeys")
AttributeError: type object 'org.python.modules.operator' has no attribute 'methodcaller'

How to overcome that? I use tz in the following manner

to_zone = dateutil.tz.gettz('US/Eastern')
from_zone = dateutil.tz.gettz('UTC')

and then I change the timezone of my timestamps. Can I just import dateutil to do that? what is the proper syntax?

UPDATE 2

Following yakuza's suggestion, I am able to

import sys
sys.path.append('/opt/python/lib/python2.7/site-packages')
sys.path.append('/opt/python/lib/python2.7/site-packages/pytz/zoneinfo')

import pytz

but now I get and error again

Caused by: Traceback (most recent call last): File "udfs.py", line 158, in to_date_local File "__pyclasspath__/pytz/__init__.py", line 180, in timezone pytz.exceptions.UnknownTimeZoneError: 'America/New_York'

when I define

to_zone = pytz.timezone('America/New_York')
from_zone = pytz.timezone('UTC')

Found some hints here UnknownTimezoneError Exception Raised with Python Application Compiled with Py2Exe

What to do? Awww, I just want to convert timezones in Pig :(

Community
  • 1
  • 1
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 1
    Regarding your second update, https://git.launchpad.net/pytz/tree/src/pytz/__init__.py#n180 suggests that 'America/New_York' is not in `all_timezones_set`. From source code it seems that this exception is either thrown if timezone is not composed of ASCII characters, or is not in known timezones list. Verify if your installation is not corrupted and that this entry is actually located in `pytz/__init__.py` file. – Yakuza Sep 01 '16 at 20:44
  • I am trying right now with `US/Eastern`. That should work, right? – ℕʘʘḆḽḘ Sep 01 '16 at 20:50
  • 1
    Well I don't believe your issue lies in which timezone you pick, both of them should be available out of the box in `pytz` - so yes, that should work. – Yakuza Sep 01 '16 at 21:49

2 Answers2

4

Well, as you probably know all Python UDF functions are not executed by Python interpreter, but Jython that is distributed with Pig. By default in 0.12.0 it should be Jython 2.5.3. Unfortunately six package supports Python starting from Python 2.6 and it's package required by dateutil. However pytz seems not to have such dependency, and should support Python versions starting from Python 2.4.

So to achieve your goal you should distribute pytz package to all your nodes for version 2.5 and in your Pig UDF add it's path to sys.path. If you complete same steps you did for dateutil everything should work as you expect. We are using very same approach with pygeoip and it works like a charm.

How does it work

When you run Pig script that references some Python UDF (more precisely Jython UDF), you script gets compiled to map/reduce job, all REGISTERed files are included in JAR file, and are distributed on nodes where code is actually executed. Now when your code is executed, Jython interpreter is started and executed from Java code. So now when Python code is executed on each node taking part in computation, all Python imports are resolved locally on node. Imports from standard libraries are taken from Jython implementation, but all "packages" have to be install otherwise, as there is no pip for it. So to make external packages available to Python UDF you have to install required packages manually using other pip or install from sources, but remember to download package compatible with Python 2.5! Then in every single UDF file, you have to append site-packages on each node, where you installed packages (it's important to use same directory on each node). For example:

import sys
sys.path.append('/path/to/site-packages')
# Imports of non-stdlib packages

Proof of concept

Let's assume some we have following files:

/opt/pytz_test/test_pytz.pig:

REGISTER '/opt/pytz_test/test_pytz_udf.py' using jython as test;

A = LOAD '/opt/pytz_test/test_pytz_data.csv' AS (timestamp:int);
B = FOREACH A GENERATE
    test.to_date_local(timestamp);

STORE B INTO '/tmp/test_pytz_output.csv' using PigStorage(',');

/opt/pytz_test/test_pytz_udf.py:

from datetime import datetime
import sys

sys.path.append('/usr/lib/python2.6/site-packages/')

import pytz

@outputSchema('date:chararray')
def to_date_local(unix_timestamp):
    """
    converts unix timestamp to a rounded date
    """
    to_zone = pytz.timezone('America/New_York')
    from_zone = pytz.timezone('UTC')

    try :
        as_datetime = datetime.utcfromtimestamp(unix_timestamp)
            .replace(tzinfo=from_zone).astimezone(to_zone)
            .date().strftime('%Y-%m-%d')
    except:
        as_datetime = unix_timestamp
    return as_datetime

/opt/pytz_test/test_pytz_data.csv:

1294778181
1294778182
1294778183
1294778184

Now let's install pytz on our node (it has to be installed using Python version on which pytz is compatible with Python 2.5 (2.5-2.7), in my case I'll use Python 2.6):

sudo pip2.6 install pytz

Please make sure, that file /opt/pytz_test/test_pytz_udf.py adds to sys.path reference to site-packages where pytz is installed.

Now once we run Pig with our test script:

pig -x local /opt/pytz_test/test_pytz.pig

We should be able to read output from our job, which should list:

2011-01-11
2011-01-11
2011-01-11
2011-01-11
Community
  • 1
  • 1
Yakuza
  • 3,237
  • 2
  • 21
  • 18
  • thanks but now I get `Caused by: Traceback (most recent call last): File "udfs.py", line 158, in to_date_local File "__pyclasspath__/pytz/__init__.py", line 180, in timezone pytz.exceptions.UnknownTimeZoneError: 'America/New_York'` – ℕʘʘḆḽḘ Sep 01 '16 at 11:53
  • after doing some research it appears this can be due to the fact that the timezones are stored into another folder than `sys.path.append('C:/Users/me/AppData/Local/Continuum/Anaconda2/lib/site-packages')`.. Any ideas what to do? – ℕʘʘḆḽḘ Sep 01 '16 at 11:54
  • 1
    Well I guess solution would be to place this information there (http://stackoverflow.com/questions/21717411/timezone-information-missing-in-pytz). By looking at pytz [source](https://git.launchpad.net/pytz/tree/src/pytz/__init__.py#n79), there is a method called `open_resource` having this comment: "Open a resource from the zoneinfo subdir for reading. Uses the pkg_resources module if available and no standard file found at the calculated location." So best solution would be to place database in one of these locations. – Yakuza Sep 01 '16 at 13:41
  • what do you mean? what should I do with `open_resource`? – ℕʘʘḆḽḘ Sep 01 '16 at 13:42
  • 1
    Sorry, pressed "enter" too quickly. Edited above. – Yakuza Sep 01 '16 at 13:43
  • sorry for my noobiness but can you tell me how could I do that? what would be the code to include in my udfs? – ℕʘʘḆḽḘ Sep 01 '16 at 13:44
  • 1
    Yes, sure. If you do clean install of `pytz` it creates directory structure in your local `dist-packages`. In my case it would be: `/usr/local/lib/python2.7/dist-packages/pytz`. Inside you will find folder named: `zoneinfo`. What you need to do is to make sure, that this folder is distributed on all nodes to where `pytz` is installed. Just like it should be after proper installation. – Yakuza Sep 01 '16 at 13:48
  • yes but this is the problem: all the nodes have a proper anaconda distribution so they all have pytz already... – ℕʘʘḆḽḘ Sep 01 '16 at 13:52
  • 1
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/122427/discussion-between-yakuza-and-noobie). – Yakuza Sep 01 '16 at 13:53
  • Thanks again yakuza for your kind help. Let me try that. But what do you mean exactly by `Please make sure, that file /opt/pytz_test/test_pytz_udf.py adds to sys.path reference to site-packages where pytz is installed.` ? – ℕʘʘḆḽḘ Sep 01 '16 at 20:55
  • oh oki I get it, just adding the path where `pytz` is installed. but that was already the case... :( – ℕʘʘḆḽḘ Sep 01 '16 at 20:57
  • 1
    So let's dig a bit deeper. To make sure everything in your file system is set up properly, log somewhere in your udf result of this code: `pytz.resource_exists('America/New_York')` Best way would be to dump it to file or raise RuntimeError with proper message. In fact you could also make use of following information: `str(pytz.all_timezones)` – Yakuza Sep 01 '16 at 21:44
  • Hi Yakuza. I think we re close. I have pytz installed in all my nodes. question is, the path in my udfs refers to which path? The one on the main computer or the ones on my nodes? Strangely enough, the line with `UTC` does work, the line with `America/New_York` or `US/Eastern` causes the `pytz.exceptions.UnknownTimeZoneError: 'US/Eastern'` – ℕʘʘḆḽḘ Sep 02 '16 at 16:09
  • 1
    As you can find in pytz source https://git.launchpad.net/pytz/tree/src/pytz/__init__.py#n89 if you provide time zone 'US/Eastern' it looks for file: `__file__/zoneinfo/US/Eastern` and file is path to your `pytz/__init__.py` file. So it should definitely look for the ones on your nodes, precisely on node where your Pig code is being executed. – Yakuza Sep 02 '16 at 19:36
  • 1
    Oh, and UTC is working because it's special case and is not being loaded from file system: https://git.launchpad.net/pytz/tree/src/pytz/__init__.py#n244 – Yakuza Sep 03 '16 at 09:12
  • great! I will try some things. Im pretty sure this question is gonna have a lot of views. So assuming that each package is installed in a different directory (in the nodes, in the master), then I should add **all of them** in the `udfs.py` file. Correct? – ℕʘʘḆḽḘ Sep 03 '16 at 15:50
  • 1
    Correct, if you have pytz installed in different directories on different nodes then you have to include all possible locations on all nodes, as you can't tell on which one code will be executed. – Yakuza Sep 03 '16 at 21:05
  • Hi @Yakuza, I guess its time for the very last question ;-) I have my `pytz` package installed, and I see the `egg` file. Should I add the path to the `egg`? – ℕʘʘḆḽḘ Sep 06 '16 at 11:58
  • Hey @Noobie :) no, it's not necessary, you simply have to include path to all site-packages/dist-packages where `pytz` is installed on all nodes. Path where `pytz` looks for it's files is computed relatively to `pytz` directory in which package resides. – Yakuza Sep 06 '16 at 12:17
1

From the answer to a different but related question, it seems that you should be able to use resources as long as they are available on each of the nodes.

I think you can then add the path as described in this answer regarding jython, and load the modules as usual.

Append the location to the sys.path in the Python script:

import sys
sys.path.append('/usr/local/lib/python2.7/dist-packages')
import happybase
Community
  • 1
  • 1
Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
  • thanks but I still get `org.apache.pig.backend.executionengine.ExecException: ERROR 1121: Python Error. Traceback (most recent call last): File "udfs.py", line 18, in from dateutil import tz ImportError: No module named dateutil` – ℕʘʘḆḽḘ Aug 29 '16 at 16:12
  • and python-dateutil has been installed in all my nodes – ℕʘʘḆḽḘ Aug 29 '16 at 16:12
  • 1
    @Noobie Are you able to do the import when running a python script manually on the slave node?-- Perhaps you need to append the package location to the path, I have edited the answer. – Dennis Jaheruddin Aug 30 '16 at 06:56