I'm using pyspark
to do some processing of server logs, and I'm quite new to functional programming concepts. I have a lookup table that I'm using in my function to select from a number of options like so:
user_agent_vals = {
'CanvasAPI': 'api',
'candroid': 'mobile_app_android',
'iCanvas': 'mobile_app_ios',
'CanvasKit': 'mobile_app_ios',
'Windows NT': 'desktop',
'MacBook': 'desktop',
'iPhone': 'mobile',
'iPod Touch': 'mobile',
'iPad': 'mobile',
'iOS': 'mobile',
'CrOS': 'desktop',
'Android': 'mobile',
'Linux': 'desktop',
'Mac OS': 'desktop',
'Macintosh': 'desktop'
}
def parse_requests(line):
"""
Expects an input list, which is then mapped to the correct fieldnames in
a dict.
:param line: A list of values.
:return: A list containing the values for writing to a file.
"""
values = dict(zip(requests_fieldnames, line))
print(values)
values['request_timestamp'] = values['request_timestamp'].split('-')[1]
found = False
for key, value in user_agent_vals.items():
if key in values['user_agent']:
found = True
values['user_agent'] = value
if not found:
values['user_agent'] = 'other_unknown'
return [
values['user_id'],
values['context_id'],
values['request_timestamp'],
values['user_agent']
]
I don't want to re-define the dictionary every time I call the function (which will be millions of times), but it seems somehow 'dirty' to just use Python's LEGB lookup to let it find the dictionary in the module namespace. Should I pass in an argument (and if so, how?) to the map function that calls parse_requests
, or what would be the best practice way to handle this?
For reference, here is my map call:
parsed_data = course_data.map(parse_requests)