1

I need to find the fastest way to tokenize a signal. The signal is of the form:

identifier:value identifier:value identifier:value ...

identifier only consists of alphanumerics and underscores. identifier is separated from previous value by a space. Value may contain alphanumerics, various braces/brackets and spaces.

e.g. signal_id:debug_word12_ind data:{ } virtual_interface_index:0x0000 module_id:0x0001 module_sub_id:0x0016 timestamp:0xcc557366 debug_words:[0x0006 0x0006 0x0000 0x0000 0x0000 0x0000 0xcc55 0x70a9 0x4c55 0x7364 0x0000 0x0000] sequence_number:0x0174

The best I've come up with is below. Ideally I'd like to halve the time it takes. I've tried various things with regexes but they're no better. Any suggestions?

# Convert data to dictionary. Expect data to be something like
# parameter_1:a b c d parameter_2:false parameter_3:0xabcd parameter_4:-56

# Split at colons. First part will be just parameter name, last will be just value
# everything in between will be <parameter name><space><value>
parts1 = data.split(":")
parts2 = []
for part in parts1:
    # Copy first and last 'as is'
    if part in (parts1[0], parts1[-1]):
        parts2.append(part)
    # Split everything in between at last space (don't expect parameter names to contain spaces)
    else:
        parts2.extend(part.rsplit(' ', 1))

# Expect to now have [parameter name, value, parameter name, value, ...]. Convert to a dict
self.data_dict = {}
for i in range(0, len(parts2), 2):
    self.data_dict[parts2[i]] = parts2[i + 1]
martineau
  • 119,623
  • 25
  • 170
  • 301
quite68
  • 39
  • 4
  • have you considered PyParsing? But if the format is simple enough, probably just rewriting your code in Cython would work well. – norok2 Oct 04 '19 at 16:43

1 Answers1

0

I have optimized your solution a little:

1) Removed the check from the loop.

2) Changed a dictionary creation code: Pairs from single list.

parts1 = data.split(":")

parts2 = []
parts2.append(parts1.pop(0))

for part in parts1[0:-1]:
    parts2.extend(part.rsplit(' ', 1)) 

parts2.append(parts1.pop())


data_dict = {k : v for k, v in zip(parts2[::2], parts2[1::2])}
MiniMax
  • 983
  • 2
  • 8
  • 24
  • Thanks for that. Unfortunately it makes no real difference to the processing time. – quite68 Oct 06 '19 at 06:50
  • @quite68 I made big string by repeating your `data` string 100 000 times and got `0,735s` with your code. With my - `0,568s`. I used the Linux `time ./source.py` command for testing. Thus, 30% gain. – MiniMax Oct 06 '19 at 10:53
  • My tests are different to yours. I've put both pieces of code in separate routines and called the routines 100000. When I did this earlier your code was sometimes faster and sometimes slower - but I was using a VM. I've now repeated with a NUC using time as you did and your code is about 15% faster. I would give you an uptick but I don't have enough rep points. – quite68 Oct 07 '19 at 08:03