I need to find the fastest way to tokenize a signal. The signal is of the form:
identifier:value identifier:value identifier:value ...
identifier
only consists of alphanumerics and underscores. identifier
is separated from previous value by a space. Value may contain alphanumerics, various braces/brackets and spaces.
e.g.
signal_id:debug_word12_ind data:{ } virtual_interface_index:0x0000 module_id:0x0001 module_sub_id:0x0016 timestamp:0xcc557366 debug_words:[0x0006 0x0006 0x0000 0x0000 0x0000 0x0000 0xcc55 0x70a9 0x4c55 0x7364 0x0000 0x0000] sequence_number:0x0174
The best I've come up with is below. Ideally I'd like to halve the time it takes. I've tried various things with regexes but they're no better. Any suggestions?
# Convert data to dictionary. Expect data to be something like
# parameter_1:a b c d parameter_2:false parameter_3:0xabcd parameter_4:-56
# Split at colons. First part will be just parameter name, last will be just value
# everything in between will be <parameter name><space><value>
parts1 = data.split(":")
parts2 = []
for part in parts1:
# Copy first and last 'as is'
if part in (parts1[0], parts1[-1]):
parts2.append(part)
# Split everything in between at last space (don't expect parameter names to contain spaces)
else:
parts2.extend(part.rsplit(' ', 1))
# Expect to now have [parameter name, value, parameter name, value, ...]. Convert to a dict
self.data_dict = {}
for i in range(0, len(parts2), 2):
self.data_dict[parts2[i]] = parts2[i + 1]