0

I want to get an array of all the lines which start by text: (till the first asset_performance_label)

I saw this post, but wasn't sure how to apply it.

Should I convert the proto to string, as I have tried?

    text = extract_text_from_proto(r"(\w+)text:(\w+)asset_performance_label:", '''[pinned_field: HEADLINE_1
    text: "5 Best Products"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }
    , pinned_field: HEADLINE_1
    text: "10 Best Products 2021"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }''')


def extract_text_from_proto(regex, proto_string):
    regex = re.escape(regex)
    result_array = [m.group() for m in re.finditer(regex, proto_string)]
    return result_array
    # return [extract_text(each_item, regex) for each_item in proto],


def extract_text(regex, item):
    m = re.match(regex, str(item))
    if m is None:
        # text = "MISSING TEXT"
        raise Exception("Ad is missing text")
    else:
        text = m.group(2)
    return text

Expected result: ["5 Best Products","10 Best Products 2021"]

What if I want to match (optional) pinned_field: (word)? so the result could be: [HEADLINE_1: 5 Best Products', 'HEADLINE_1:10 Best Products 2021', 'some_text_without_pinned_field']` ?

Elad Benda
  • 35,076
  • 87
  • 265
  • 471

1 Answers1

2

You can use a single capture group, and match assert_performance_label in the next line. Use re.findall to return the group values.

\btext:\s*"([^"]+)"\n\s*asset_performance_label\b

The pattern matches

  • \btext:\s*" Match text: predeced by a word boundary \b to prevent a partial match
  • ([^"]+) Capture group 1, match 1+ chars other than a double quote
  • "\n\s* Match a newline an optional whitespace chars
  • asset_performance_label\b Match `asset_performance_label followed by a word boundary

For example

import re

def extract_text_from_proto(regex, proto_string):
    return re.findall(regex, proto_string)

text = extract_text_from_proto(r'\btext:\s*"([^"]+)"\n\s*asset_performance_label\b', '''[pinned_field: HEADLINE_1
    text: "5 Best Products"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }
    , pinned_field: HEADLINE_1
    text: "10 Best Products 2021"
    asset_performance_label: PENDING
    policy_summary_info
    {
        review_status: REVIEWED
        approval_status: APPROVED
    }''')


print(text)

Output

['5 Best Products', '10 Best Products 2021']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • why don't we need to use ` regex = re.escape(regex)`? – Elad Benda Sep 07 '21 at 21:03
  • What does `\b` mean? – Elad Benda Sep 07 '21 at 21:04
  • @EladBenda I will add a breakdown of the pattern. You can omit [re.escape](https://docs.python.org/3/library/re.html#re.escape) as we create the pattern and we already know what to escape. – The fourth bird Sep 07 '21 at 21:05
  • Thank you so much. What if I want to match (optional) `pinned_field: (word)`? so the result could be: [`HEADLINE_1: 5 Best Products', 'HEADLINE_1:10 Best Products 2021', 'some_text_without_pinned_field'] ? – Elad Benda Sep 07 '21 at 21:43
  • @EladBenda You could get 2 capture groups, which will also be returned by re.findall `pinned_field: (.+)\n\s*text: "([^"]+)"` https://regex101.com/r/yBiByl/1 – The fourth bird Sep 07 '21 at 21:49
  • Thanks, but how will it handle cases when there is no `pinned_field`? Plus how will it concatenate the two groups when the first is optional? – Elad Benda Sep 07 '21 at 22:24
  • 1
    @EladBenda Then you could make the first group optional https://regex101.com/r/CFZXHp/1 and re.findall will return a list of tuples where the first one is empty if there is no group 1. See https://ideone.com/MmDT1s or you can use re.finditer and check for the existence of group 1 and concatenate the groups https://ideone.com/QMAyIY – The fourth bird Sep 07 '21 at 23:20
  • Thanks, no need to add `gm` flags? `g = findall`, but what about `m` what makes it multi line match? – Elad Benda Sep 07 '21 at 23:37
  • @EladBenda There is no `g` flag, re.findall already returns all matches. For multiline you can use `re.M` or [re.MULTILINE](https://docs.python.org/3/library/re.html#re.MULTILINE) but you don't have to use multiline as the pattern has no anchors `^` or `$` – The fourth bird Sep 07 '21 at 23:40
  • I see `gm` flags in your simulator, no? https://i.stack.imgur.com/kgyIx.png – Elad Benda Sep 08 '21 at 00:06
  • @EladBenda Yes, you can clear the m flag, the g flag is on to get all matches on regex101 (which re.findall already returns) https://regex101.com/r/GEh1LP/1 – The fourth bird Sep 08 '21 at 00:08