Regex Expression to capture repeated patterns

Question

I've been running around internet trying to find out how to build a regular expression to capture text in the way I need it; so I saw some StackOverflow questions but none of them express what I want, but if you already saw something similar to my issue here, pelase feel free to pointme to that article...

I tried to use recursion but it seems I'm not good enough to get something to work

Some notes:

1) I can't use a parse program because the program that will use this data will use regular expression to capture it, and this program is a "general purpose" program that in fact is capturing any data that is needed, only thing I need to do is give proper regular expression to get information it needs, also I need to keep it as copact as possible, so I can't use third party or external programs.

2) The pair 'key': 'value' can vary, they are not always the same number of pairs... that is what make it difficult I believe.

3) Program that is going to use this regex is created in Python 2.7.3: How this program works: it uses a Json config file where I can setup command I want to run that will give to me data I need, then I specify a regex to teach to the program what need to be captured and how to handle it ie: what to do with the groups that get captured... so that is why I can't use a parser. This program uses fabric to run configued collector(with the regex) to remote hosts and gather all data...

4) Program is used to gather data to post them into a webserver and get metrics and other stuff like graphs and monitor alarms etc

I have been able to capture almost all data I was planing to capture but when I was trying to create a collector for this then I got stuck..

The following data repeats exactly like below but with different server names, of course values will change too:

Server: Omega-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}


Server: Alfa-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}

How I want to capture it:

Server: Omega-X

 transfer_data: 0
 factor_a: 0
 slow: 0
 factor_b: 0
 score_retry: 0
 damage_factor_c: 0
 voice_ud: 0
 alarm_factors_bl: 0
 telemetry_x: 0
 endstream: 0
 celery: 0
 awl: 0
 trx: 0
 points: 0
 feature_factors_xf: 0
 feature_factors_dc: 0

Server: Alfa-X

 transfer_data: 0
 factor_a: 0
 slow: 0
 factor_b: 0
 score_retry: 0
 damage_factor_c: 0
 voice_ud: 0
 alarm_factors_bl: 0
 telemetry_x: 0
 endstream: 0
 celery: 0
 awl: 0
 trx: 0
 points: 0
 feature_factors_xf: 0
 feature_factors_dc: 0

If a unique server is shown, then is not so difficult, using the below regex I'm able to capture all (except name of server):

'([a-z_]+)':\s'(\d+)'

This regex will give only the second part, which is the list of variables and values, but not the Server name... so if I get on same output several servers with the same data, then will be impossible to know from which server the values are coming from...

If I try to add support for the server name: I've tried follwoing regex, it works but only capture Server name, and first pair of parameters:

Server:\s([a-zA-Z0-9-]+)\s*celery\.queue_length:\s.('([a-z_]+)':\s'(\d+)')*

I had tried multiple recursion features but I've failed to achieve what I want.

Can anyone point me to right direction here...?

Thanks.

Your first pattern worked because it matched each key/value pair one at a time. It sounds like what you're trying to do now is take care of the entire set in one go, with just a regex pattern. Getting your program to capture an arbitrary number of groups like this is probably not possible (but it wouldn't be so bad if the number were the same, as you mentioned above). Instead of focusing on capturing the key/value data, can you get by with a formatting operation, such as matching `celery\.queue_length: \{|,` and replacing with `\n`? See: https://regex101.com/r/rzmJgj/1 — CAustin, Sep 13 '17 at 00:47
Hi CAustin, thanks for your response. It may be useful, but still need some cleaning as you mention. After sleeping, I started to think as you and other may mention that is almost impossible to achieve with only regex, because I need to capture groups too... — Larry, Sep 13 '17 at 08:03
I had to admit I wll need to go against my own rules here, as what I want is really difficult for the program to achieve using Regex. What I meant is; my program need to have groups to be able to gather data, so in this case using regex it will give to me n number of groups that I may not be able to handle properly... so Probably a Pre-parse program will be required... — Larry, Sep 13 '17 at 11:00

score 1 · Accepted Answer · answered Sep 13 '17 at 08:16

You want key-value ? with python I would use the dictionary.

get the server name and the string containing the data:
Server: ([^\n]*)(?:[^{]*)\{(.*)\}
build a dict with the string containing the data for each server:

With python (you only need import re statement):

input = """Server: Omega-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}

Server: Alfa-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}"""


for match in re.findall(r'Server: ([^\n]*)(?:[^{]*)\{(.*)\}', input):
    server = match[0]
    data = match[1]
    datadict = dict((k.strip().replace("'", ""), v.strip().replace("'", "")) for k,v in (item.split(':') for item in data.split(',')))
    datadict['serveur'] = server

Then you can store each datadict (e.g. in a list) and use then as you want. You can cast the values from string to integer to manipulate them easily.

Sorry mquantin, didn't saw your answer before posting my other idea in here, I will give a check to your advice too, thank you! — Larry, Sep 13 '17 at 12:18
Hi mquantin, you know what, I like this one, because it will allow to capture all without altering original format which I think is better... I was avoiding to go into python code of the collector to do this only for this particular output but I think there's no other easy way to get it... thanks a lot I think I will use this approach. — Larry, Sep 13 '17 at 12:35

Hooman Bahreini · Answer 2 · 2017-09-13T08:45:17.920

You can use Antlr, to define your grammer and would be a better option than regular expression: https://dzone.com/articles/antlr-4-with-python-2-detailed-example

If you want to use regular expression, you can use the following, please note my code is in C#, but regular expression should behave the same in Python.

string serverNamePattern = @"(?<=Server(\s)*:(\s))\s*[\w-]+";
string dataPattern = @"(?<=celery.queue_length[\s:]*{)[a-zA-Z0-9\s:\'_,]+";
string input = 
    "Server: Omega-X" + 
    "celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}" + 
    "Server: Alfa-X" + 
    "celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}";

var serverNames = Regex.Matches(input, serverNamePattern);
var dataMatches = Regex.Matches(input, dataPattern);

Explanation:

+: one or more occurrence

\w: alphanumeric

\s: white space

[]: define a range

(?<=a)b: positive lookbehind, match b that comes after a

(?<=Server(\s):(\s))\s[\w-]+: match alphanumeric,- and white space that comes after Server:

(?<=celery.queue_length[\s:]*{)[a-zA-Z0-9\s:\',]+: match a range of [a-zA-Z0-9':,\s] that comes after celery.queue_length:

Note that you need to add "Server: " before server name. also this does not remove single quotes from the data.

Hi Hooman, I understand your approach, but not sure if I can apply it to program I have. It may be possible it change the behavior for other metrics that are working, but nevertheless, I can try... Probably I need a new point of view... Thanks! — Larry, Sep 13 '17 at 08:06
sorry, I had not put the correct code, updated. Please note, Matches return a list of matches that you can iterate, hence there is no problem with regex and matching groups... but I think Antlr is a better option for complex grammar. — Hooman Bahreini, Sep 13 '17 at 08:46
Hi Hooman, in a second thought a pre-parser program (yes I know I'm going agaist myself here) will be required, to prepare output as I wish... — Larry, Sep 13 '17 at 11:04

score 0 · Answer 3 · answered Sep 13 '17 at 12:16

thanks guys that kindly responded my question, I think both of you help me to reshape way I'm seeing this issue...

My believe is, what I want to achieve here is very difficult for a regex:

Giving the difficulty of how to get information I want. I was thinking in which way will be easier for me to get this information. So I know I'm going against my own rules here, but I think there's no other way to go smoothly I believe.

If I want to get regex group like:

Server: Group 0
Key : Group 1
Value: Group 2

then output I will need should be like:

Regex Groups:
        (0)      (1)          (2)         
Server: Omega-X transfer_data: 0
Server: Omega-X factor_a: 0
Server: Omega-X slow: 0
Server: Omega-X factor_b: 0
Server: Omega-X score_retry: 0
Server: Omega-X damage_factor_c: 0
Server: Omega-X voice_ud: 0
Server: Omega-X alarm_factors_bl: 0
Server: Omega-X telemetry_x: 0
Server: Omega-X endstream: 0
Server: Omega-X celery: 0
Server: Omega-X awl: 0
Server: Omega-X trx: 0
Server: Omega-X points: 0
Server: Omega-X feature_factors_xf: 0
Server: Omega-X feature_factors_dc: 0

In this way I can process any number of servers in the same output without any difficult and using a very simple regex...

"Server:\s([a-zA-Z_.-]+)\s'([a-zA-Z_]+)':\s'(\d+)'"

So I think the best way to go, is adding a Pre-Parser to prepare data like this, and then process it...

In fact, both of you help me on this, much appreciated.

I guess I will close this question unless somebody else as a better idea :)

Regex Expression to capture repeated patterns

3 Answers3