-2

All:

As the subject states, I'm running into an issue with Grep Perl Non-Greedy Scope RegEx Matching on an Empty String.

[Note: For the purposes of this example assume that the 'title' can be a complex, alpha-numeric, special-character, multi-word, space-separated, string.]

# echo "<span class=\"title\"></span><span class=\"price\">0.25</span><span class=\"title\">Banana</span><span class=\"price\">0.10</span><span class=\"title\">Grape</span><span class=\"price\">0.05</span>" | /opt/bin/grep -ioP "<span class=\"title\">(.+?)</span><span class=\"price\">(.+?)</span>" | sed "s/<span class=\"title\">//g; s/<span class=\"price\">/|/g; s/<\/span>//g;"
|0.25Banana|0.10
Grape|0.05

As you can see, the first 'title' match is empty, but the grep perl non-greedy scope regex (.+?) still matches.

Shouldn't the first 'title' match be ignored? What am I missing?

Thank you for your assistance.

UPDATE:

Negating the lessthan-sign ([^<]+?) is a good solution with the original, basic example. However, I'm finding that it runs into problems when more data is introduced.

I've attempted to expand the match to include additional trailing tags, but the regex appears to still be failing with that change as well.

# echo "<span class=\"title\"></span></div></div><span class=\"price\">0.25</span><span class=\"title\">Banana</span></div></a><span class=\"price\">0.10</span><span class=\"title\">Grape</span></div></a><span class=\"price\">0.05</span>" | grep -ioP "<span class=\"title\">(.+?)</span></div></a><span class=\"price\">(.+?)</span>" | sed "s/<span class=\"title\">//g; s/<span class=\"price\">/|/g; s/<\/span>//g; s/<\/div>//g; s/<\/a>//g;"
|0.25Banana|0.10
Grape|0.05

Shouldn't the regex match on the </span></div></a> tags, but not on the </span></div></div> tags?

Thanks, again, for your time and assistance.

Andy A.
  • 1,392
  • 5
  • 15
  • 28
  • 11
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) – Cyrus May 29 '21 at 10:01
  • In the pattern you use `(.+?)` where there should be at least a single char matched due to the `+` So it will match until the first closing span and it can not match the empty span. You can change it it `.*?` – The fourth bird May 29 '21 at 10:02
  • The goal is to not match the group, if the 'title' or 'price' is empty. Changing the regex to .*? will enable empty strings to match, which is counter to what I'm attempting to accomplish. I appreciate your feedback. – Gary C. New May 29 '21 at 11:32
  • `([^<]+?)` might do it - if your title does not contain `<` characters... There are reasons why people advise you to not do it with regexes. – clamp May 29 '21 at 15:16

2 Answers2

1

Your elected regular expression <span class="title">(.+?)</span> which assumes a presence at least one symbol in title tag - what leads regex to capturing from this place skipping empty tag until next closing </span> tag, definitely not what you intended to achieve.

Perhaps following code is self explanatory

use strict;
use warnings;

my $re = qr!<span class="title">(.+?)</span><span class="price">(.*?)</span>!;

my $input = do { local $/; <DATA> };
my %data = $input =~ /$re/g;

for my $k ( sort keys %data ) {
    printf "| %-10s | %6.2f |\n", $k, $data{$k};
}

__DATA__
<span class="title"></span><span class="price">0.25</span><span class="title">Banana</span><span class="price">0.10</span><span class="title">Grape</span><span class="price">0.05</span>

Output

| </span><span class="price">0.25</span><span class="title">Banana |   0.10 |
| Grape      |   0.05 |

Perhaps you intended to use following regular expression

use strict;
use warnings;

my $re = qr!<span class="title">([^<]+?)</span><span class="price">(.*?)</span>!;

my $input = do { local $/; <DATA> };
my %data = $input =~ /$re/g;

for my $k ( sort keys %data ) {
    printf "| %-10s | %6.2f |\n", $k, $data{$k};
}

__DATA__
<span class="title"></span><span class="price">0.25</span><span class="title">Banana</span><span class="price">0.10</span><span class="title">Grape</span><span class="price">0.05</span>

Output

| Banana     |   0.10 |
| Grape      |   0.05 |

So, if you chosen an approach to utilize grep and sed then command perhaps would take following shape

echo "<span class=\"title\"></span><span class=\"price\">0.25</span><span class=\"title\">Banana</span><span class=\"price\">0.10</span><span class=\"title\">Grape</span><span class=\"price\">0.05</span>" | grep -ioP "<span class=\"title\">([^<]+?)</span><span class=\"price\">(.+?)</span>" | sed "s/<span class=\"title\">//g; s/<span class=\"price\">/|/g; s/<\/span>//g;"

Output

Banana|0.10
Grape|0.05

If perl available in your system perhaps it would be easier to utilize it's power.

Polar Bear
  • 6,762
  • 1
  • 5
  • 12
  • What he gets? A cascade of vague hints on using Perl modules ("don't use regex for parsing XML") and two pages of Perl code with "perhaps the following code is self-explanatory"... Are you kidding? Résumé: The guy got the feeling that he landed at some 51 Area, with a bunch of aliens trying to tell him something in their own language, Perl; He removed the Perl tag and run away. – Aditya May 29 '21 at 23:23
  • Well, then the guy need to spend some time on learning what is regular expression and how to use the knowledge. I've indicated in first sentence _regular expression (.+?) which assumes a presence at least one symbol in title tag_ and explained what will happen with it's use _what leads regex to capturing from this place skipping empty tag until next closing tag_. If this language too alien for him then he has an option to ask for easier explanation. I guess that `grep` and `sed` is not less alien to him -- still he elected to use them. – Polar Bear May 30 '21 at 00:41
  • @PolarBear Negating the lessthan-sign ([^<]+?) is a good solution with the original example. However, I'm finding that it runs into problems when more data is introduced. I've attempted to expand the match to include additional tags, but the regex appears to be failing with it as well. Please refer to the update in my original post. Thank you for your suggestion and examples. – Gary C. New May 30 '21 at 01:55
  • @Gary C. New -- [Ruby](https://www.ruby-lang.org/en/documentation/), [python](https://docs.python.org/3/), [PHP](https://www.php.net/). – Polar Bear May 30 '21 at 04:44
  • @Gary C. New -- I do not say that `grep`,`sed`,`awk` have not use - no they still very good tools but for other tasks. Please read following documents [How to ask good question](https://stackoverflow.com/help/how-to-ask), [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) -- it will allow better formulate what you try to achieve and you will give enough information to responding parties to understand the problem and offer a better solution. – Polar Bear May 30 '21 at 04:47
  • @Gary C. New -- in your question it would be nice to mention something in the spirit - _I capture some html code from a web page and would like to extract some information. I came up with following solution but my approach does not produce expected result. Please advice what direction I should follow to achieve the desired result._ And give a sample of output or may be data structure in a variable as an example of desired output or data. This should give enough information to work with -- or you will be asked for extended information to fill the gaps. – Polar Bear May 30 '21 at 04:53
  • @Gary C. New -- In your question you do not mention where the data (html) is originated from. If it is a web page then you could put into the question a reference to the page. Then indicate what data of interest and how desired result should be presented (you gave a clue in your output sample -- good). If you desire only output data on the screen or you want process the data further -- for example to save into a file or into a database. It will give an idea the purpose of your attempt and what would be best intermediary data format (hash, array, file). – Polar Bear May 30 '21 at 05:04
  • @PolarBear Success! With your guidance, I finally figured out the optimal solution for my particular issue, still making use of the original non-greedy scope regex match (.+?), which was to include additional leading tags
    that uniquely identified the specific groups I was targeting while excluding those that did not match. Appreciate your assistance and positive feedback.
    – Gary C. New May 31 '21 at 13:20
0

@PolarBear Success! With your guidance, I finally figured out the optimal solution for my particular issue, still making use of the original non-greedy scope regex match (.+?), which was to include additional leading tags that uniquely identified the specific groups I was targeting while excluding those that did not match. Appreciate your assistance and positive feedback.

  • Gary -- I suggest to look at following [document](https://www.regular-expressions.info/lookaround.html) and particularly for lookahead and lookbehind regex patterns. Perhaps next _Stackoverflow_ [question](https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups) will provide you with some useful information which can be used in your case. – Polar Bear May 31 '21 at 19:49