Best way to identify some string in Python

Question

I'm writing a system to read data coming from devices that do the tracking of trucks.

This system will receive information of different types of equipment, thus being the trace strings that will receive will be different, deriving the equipment model.

So, I need an idea how to identify these strings to give the correct treatment for the same. For example, one of the units sends the following string:

[0,0,13825,355255057406002,0,250814,142421,-2197354498319328,-4743040708824992,800,9200,0,0,0,0,0,12,0,31,0,107]

Another device, the string comes this way:

SA200STT;459055;209;20140806;23:18:28;20702;-22.899244;-047.047640;000.044;000.00;11;1;68548721;12.60;100000;2;0016

So my question is, what is the best way for me to identify each of these strings?

A good start might be to check the first character in the string? — Some programmer dude, Aug 25 '14 at 17:30
What do you mean by "identify"? Do you want to parse it or just distinguish between these two versions? — Falko, Aug 25 '14 at 17:31
Yes, but, in others devices, the first char could be equals to another device. — Vitor Villar, Aug 25 '14 at 17:31
At first, i wanna distinguish between these two versions, therefore, I already know what each one, I can give a different treatment — Vitor Villar, Aug 25 '14 at 17:32
IMO, the devices should send their data through different means to the required tool that will read them. Right now, it's like you have 5 locked doors and receive a key each time without knowing what door it can open. — Jerry, Aug 25 '14 at 17:33
If two of more devices sends messages with the same prefix or header, you have to find other distinguishing features of the message. Or have just see where the message actually *came* from. — Some programmer dude, Aug 25 '14 at 17:37

Bryan Oakley · Accepted Answer · 2014-08-25T18:22:22.853

The first step is to identify what is unique about each format. In the example you give, the first string starts and ends with [], and the second version starts with the sequence "SA200STT". So, a first approximation is to match on that:

import re
def identify(s):
    if re.match(r'^\[.*\]$', s):
        return "type 1"
    elif re.match(r'^SA200STT.*$', s):
        return "type 2"
    else:
        return "unknown"

s1 = r'[0,0,13825,355255057406002,0,250814,142421,-2197354498319328,-4743040708824992,800,9200,0,0,0,0,0,12,0,31,0,107]'
s2 = r'SA200STT;459055;209;20140806;23:18:28;20702;-22.899244;-047.047640;000.044;000.00;11;1;68548721;12.60;100000;2;0016'

print "s1:", identify(s1)
print "s2:", identify(s2)

When I run the above I get:

s1: type 1
s2: type 2

I doubt that's the actual algorithm that you need, but that's the general idea. Figure out how you can tell each format apart, then make an expression that detects that.

A note about using regular expressions:

Regular expressions can be slow, and in general should be avoided if they can be avoided (not just for the speed issue, but because they can make your code hard to understand). If performance or readability is a concern, consider alternative solutions such as comparing the first N characters, or the last N characters.

+1 but you might want to avoid using regex if you have a lot of data. — dss539, Aug 25 '14 at 18:17
@dss539: good point. I added a note about avoiding regular expressions if performance or readability is a concern. — Bryan Oakley, Aug 25 '14 at 18:23
thanks @BryanOakley, that is what i need, something like this. Just a doubt about the code, what is the 'r' before the string? — Vitor Villar, Aug 25 '14 at 18:26
@VitorLuis: the 'r' says that it is a raw string literal. See http://stackoverflow.com/q/2081640/7432 — Bryan Oakley, Aug 25 '14 at 18:43

score 1 · Answer 2 · answered Aug 25 '14 at 18:14

It sounds pretty simple. Just check some distinguishing characteristic of the data to recognize the format. Depending on how complex each of your formats is, you can probably do this without using a regex.

def parse(data):
    parse_format = get_parser(data)
    return parse_format(data)

def get_parser(data):
    if is_format_a(data):
        return parse_format_a;
    if is_format_b(data):
        return parse_format_b;
    #etc

def is_format_a(data):
    return data[0] == '['

def parse_format_a(data):
    return data.strip('[]').split(',')

def parse_format_b(data):
    return data.split(';')

score 1 · Answer 3 · answered Aug 25 '14 at 19:08

Bryan Oakley give a good solution. But using his own words: The first step is to identify what is unique about each format.

You just have to check which one of the characters ; or , is present. Even if is present or not since they are exclusive!

For instance:

s1  = "[0,0,13825,355255057406002,0,250814,142421,-2197354498319328,-4743040708824992,800,9200,0,0,0,0,0,12,0,31,0,107]"
s2 = r'SA200STT;459055;209;20140806;23:18:28;20702;-22.899244;-047.047640;000.044;000.00;11;1;68548721;12.60;100000;2;0016'

if ',' in s1:
    print("Type 1")
else
    print("Type 2")

This seems to be the fastest way. since using regular expressions are slow and by reading your question I can say you will be reading from a device.Hence, you need speed.

Best way to identify some string in Python

3 Answers3