Parsing/searching a comma and semicolon-separated string in python

Question

The thing I am working on ATM has a (kinda) long strings of data like this:

56,1,0,153,0,0;56,1,0,153,0,0;56,1,0,153,0,0;5,1,2,34,B_3_1_1,0;5,1,2,34,C_9841,0;

I would like to look for the values starting with 'C_' and return the number after it. I know they will always be on the fourth position of a list of values delimited by semicolon.

I was thinking about using regular expression to parse the string into a list and searching said list, but don't think that would be very efficient.

Can anybody point me in the right direction in tackling this problem?

The number after `C_` with a comma or not? What do you mean by 'always be on the fourth position'? And yes, regex is most probably the most efficient way to search the string if it involves complex rules - you just don't need to pull/split everything into a list. — zwer, Feb 21 '17 at 00:02
there comma separated strings of 5 values: 56,1,0,153,0,0 these comma separated strings are separated by a semicolon — Stanislav Pavlovič, Feb 21 '17 at 00:04
Can `C_` appear anywhere else in the string? As for the value itself, assuming the above string, you want `9841` as a result, right? — zwer, Feb 21 '17 at 00:06
That's right, I want '9841' (a string) as a result. There will be multiple instances of C_ in such string, I need to find them all and store them in a list. — Stanislav Pavlovič, Feb 21 '17 at 00:09

zwer · Accepted Answer · 2017-02-21T00:24:51.597

You can use simple re.findall() for this:

import re

your_string = "56,1,0,153,0,0;56,1,0,153,0,0;56,1,0,153,0,0;5,1,2,34,B_3_1_1,0;5,1,2,34,C_9841,0;"

c_values = re.findall(r"C_(\d+)", your_string)  # ['9841']

EDIT: If you need your values as numbers, you can turn this into a generator:

c_values = [int(x) for x in re.findall(r"C_(\d+)", your_string)]  # [9841]

EDIT #2: Since you seem to be worried about performance, in almost all cases regex will be the fastest way to do it. If you're planning to run this on a large number of strings (not a few large strings) every little bit might help so compile your regex first and then call it when needed:

your_regex = re.compile(r"C_(\d+)")

# now use your_regex whenever you need it
c_values = your_regex.findall(your_string)  # ['9841']

Elmex80s · Answer 2 · 2017-02-21T00:19:54.083

0

This

import re

long_str = "56,1,0,153,0,0;56,1,0,153,0,0;56,1,0,153,0,0;5,1,2,34,B_3_1_1,0;5,1,2,34,C_9841,0;"

splitted_str = re.split(';|,', long_str)   

print next(int(x[2:]) for x in splitted_str if x[:2] == "C_")

An alternative

long_str = "56,1,0,153,0,0;56,1,0,153,0,0;56,1,0,153,0,0;5,1,2,34,B_3_1_1,0;5,1,2,34,C_9841,0;"

split1 = long_str.split(';')

split2 = next(y for y in split1 if "C" in y)

print next(int(x[2:]) for x in split2.split(',') if x[:2] == "C_")

edited Feb 21 '17 at 00:19

answered Feb 21 '17 at 00:00

Elmex80s

3,428
1
15
23

That's **crazy** inefficient. – zwer Feb 21 '17 at 00:15

Paul Panzer · Answer 3 · 2017-02-21T00:26:55.770

0

A simple solution is to use the .find method.

instr = "56,1,0,153,0,0;56,1,0,153,0,0;56,1,0,153,0,0;5,1,2,34,B_3_1_1,0;5,1,2,34,C_9841,0;"

results = []
index = instr.find('C_')
while index >= 0:
    length = instr[index:].find(',')
    assert length > 0
    results.append(instr[index+2:index+length])
    instr = instr[index+length:]
    index = instr.find('C_')

Another simple and probably more efficient method would be to .split on "C_":

bits = instr.split('C_')[1:]
stops = [bit.find(',') for bit in bits]
results = [bit[2:stop] for bit, stop in zip(bits, stops) if stop > 0]

edited Feb 21 '17 at 00:26

answered Feb 21 '17 at 00:11

Paul Panzer

51,835
3
54
99

You need to be sure there is a `','` after `'C_9841'`. – Elmex80s Feb 21 '17 at 00:13

score 0 · Answer 4 · edited May 23 '17 at 12:16

Suppose:

s = '56,1,0,153,0,0;56,1,0,153,0,0;56,1,0,153,0,0;5,1,2,34,B_3_1_1,0;5,1,2,34,C_9841,0;'

For one-liner that avoid regex this should work:

Python 2/3 – credit

next(i for sublist in (ss.split(',') for ss in s.split(';')) for i in sublist if i.startswith('C_'))[2:]

Python 3

import itertools # err... it becomes 3 lines

next(i for i in itertools.chain.from_iterable(
    ss.split(',') for ss in s.split(';')) if i.startswith('C_'))[2:]

However, if things get complicated, I myself prefer regex. The modern rule states "don't do premature optimization" and "make your code readable".

Parsing/searching a comma and semicolon-separated string in python

4 Answers4

Linked