fetch specific protobuf members

Question

I want to get an array of all the lines which start by text: (till the first asset_performance_label)

I saw this post, but wasn't sure how to apply it.

Should I convert the proto to string, as I have tried?

    text = extract_text_from_proto(r"(\w+)text:(\w+)asset_performance_label:", '''[pinned_field: HEADLINE_1
    text: "5 Best Products"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }
    , pinned_field: HEADLINE_1
    text: "10 Best Products 2021"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }''')


def extract_text_from_proto(regex, proto_string):
    regex = re.escape(regex)
    result_array = [m.group() for m in re.finditer(regex, proto_string)]
    return result_array
    # return [extract_text(each_item, regex) for each_item in proto],


def extract_text(regex, item):
    m = re.match(regex, str(item))
    if m is None:
        # text = "MISSING TEXT"
        raise Exception("Ad is missing text")
    else:
        text = m.group(2)
    return text

Expected result: ["5 Best Products","10 Best Products 2021"]

What if I want to match (optional) pinned_field: (word)? so the result could be: [HEADLINE_1: 5 Best Products', 'HEADLINE_1:10 Best Products 2021', 'some_text_without_pinned_field']` ?

So you want to match `5 Best Products` and `10 Best Products 2021` ? — The fourth bird, Sep 07 '21 at 20:39
Yes, please `Expected result: `["5 Best Products","10 Best Products 2021"]` — Elad Benda, Sep 07 '21 at 20:58

The fourth bird · Accepted Answer · 2021-09-07T21:07:19.987

2

You can use a single capture group, and match assert_performance_label in the next line. Use re.findall to return the group values.

\btext:\s*"([^"]+)"\n\s*asset_performance_label\b

The pattern matches

\btext:\s*" Match text: predeced by a word boundary \b to prevent a partial match
([^"]+) Capture group 1, match 1+ chars other than a double quote
"\n\s* Match a newline an optional whitespace chars
asset_performance_label\b Match `asset_performance_label followed by a word boundary

For example

import re

def extract_text_from_proto(regex, proto_string):
    return re.findall(regex, proto_string)

text = extract_text_from_proto(r'\btext:\s*"([^"]+)"\n\s*asset_performance_label\b', '''[pinned_field: HEADLINE_1
    text: "5 Best Products"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }
    , pinned_field: HEADLINE_1
    text: "10 Best Products 2021"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }''')


print(text)

Output

['5 Best Products', '10 Best Products 2021']

edited Sep 07 '21 at 21:07

answered Sep 07 '21 at 21:01

The fourth bird

154,723
16
55
70

why don't we need to use ` regex = re.escape(regex)`? – Elad Benda Sep 07 '21 at 21:03
What does `\b` mean? – Elad Benda Sep 07 '21 at 21:04
@EladBenda I will add a breakdown of the pattern. You can omit [re.escape](https://docs.python.org/3/library/re.html#re.escape) as we create the pattern and we already know what to escape. – The fourth bird Sep 07 '21 at 21:05
Thank you so much. What if I want to match (optional) `pinned_field: (word)`? so the result could be: [`HEADLINE_1: 5 Best Products', 'HEADLINE_1:10 Best Products 2021', 'some_text_without_pinned_field'] ? – Elad Benda Sep 07 '21 at 21:43
@EladBenda You could get 2 capture groups, which will also be returned by re.findall `pinned_field: (.+)\n\s*text: "([^"]+)"` https://regex101.com/r/yBiByl/1 – The fourth bird Sep 07 '21 at 21:49
Thanks, but how will it handle cases when there is no `pinned_field`? Plus how will it concatenate the two groups when the first is optional? – Elad Benda Sep 07 '21 at 22:24
1

@EladBenda Then you could make the first group optional https://regex101.com/r/CFZXHp/1 and re.findall will return a list of tuples where the first one is empty if there is no group 1. See https://ideone.com/MmDT1s or you can use re.finditer and check for the existence of group 1 and concatenate the groups https://ideone.com/QMAyIY – The fourth bird Sep 07 '21 at 23:20
Thanks, no need to add `gm` flags? `g = findall`, but what about `m` what makes it multi line match? – Elad Benda Sep 07 '21 at 23:37
@EladBenda There is no `g` flag, re.findall already returns all matches. For multiline you can use `re.M` or [re.MULTILINE](https://docs.python.org/3/library/re.html#re.MULTILINE) but you don't have to use multiline as the pattern has no anchors `^` or `$` – The fourth bird Sep 07 '21 at 23:40
I see `gm` flags in your simulator, no? https://i.stack.imgur.com/kgyIx.png – Elad Benda Sep 08 '21 at 00:06
@EladBenda Yes, you can clear the m flag, the g flag is on to get all matches on regex101 (which re.findall already returns) https://regex101.com/r/GEh1LP/1 – The fourth bird Sep 08 '21 at 00:08

fetch specific protobuf members

1 Answers1