1

I'm using pyspark to do some processing of server logs, and I'm quite new to functional programming concepts. I have a lookup table that I'm using in my function to select from a number of options like so:

user_agent_vals = {
        'CanvasAPI': 'api',
        'candroid': 'mobile_app_android',
        'iCanvas': 'mobile_app_ios',
        'CanvasKit': 'mobile_app_ios',
        'Windows NT': 'desktop',
        'MacBook': 'desktop',
        'iPhone': 'mobile',
        'iPod Touch': 'mobile',
        'iPad': 'mobile',
        'iOS': 'mobile',
        'CrOS': 'desktop',
        'Android': 'mobile',
        'Linux': 'desktop',
        'Mac OS': 'desktop',
        'Macintosh': 'desktop'
    }

def parse_requests(line):
    """
    Expects an input list, which is then mapped to the correct fieldnames in
    a dict.

    :param line: A list of values.
    :return: A list containing the values for writing to a file.
    """
    values = dict(zip(requests_fieldnames, line))
    print(values)
    values['request_timestamp'] = values['request_timestamp'].split('-')[1]
    found = False
    for key, value in user_agent_vals.items():
        if key in values['user_agent']:
            found = True
            values['user_agent'] = value
    if not found:
        values['user_agent'] = 'other_unknown'
    return [
        values['user_id'],
        values['context_id'],
        values['request_timestamp'],
        values['user_agent']
    ]

I don't want to re-define the dictionary every time I call the function (which will be millions of times), but it seems somehow 'dirty' to just use Python's LEGB lookup to let it find the dictionary in the module namespace. Should I pass in an argument (and if so, how?) to the map function that calls parse_requests, or what would be the best practice way to handle this?

For reference, here is my map call:

parsed_data = course_data.map(parse_requests)
Mike Müller
  • 82,630
  • 20
  • 166
  • 161
flybonzai
  • 3,763
  • 11
  • 38
  • 72

2 Answers2

1

It is a convention to use all upper case for such global "constants":

USER_AGENT_VALS

For example, the default settings of pylint only allow all upper case names for variables (other than functions and classes) on the module level.

Alternately, you can supply user_agent_vals as second argument:

def parse_requests(line, user_agent_vals):

Call with:

parse_requests(line, user_agent_vals)

You can "freeze" an argument to a function with functools.partial():

from functools import partial

parse_requests_for_map = partial(parse_requests, user_agent_vals=user_agent_vals)

Now, you can use it with map:

parsed_data = course_data.map(parse_requests_for_map)
Mike Müller
  • 82,630
  • 20
  • 166
  • 161
1

Put all the stuff you need in an object and make the object a "callable" (by defining some def __call__(self, arg): method), and pass the object as the function for map to use.

Nice example here (for multiprocessing's map, but the technique is more generally applicable).

Community
  • 1
  • 1
timday
  • 24,582
  • 12
  • 83
  • 135