perl - matching greater than charater in regex

Question

$string1="<a href='/channels/folder1'>Alpha-Seeking";
$string2="<a href='/channels/folder2'>No Underlying Index ,";

I need to extract "Alpha-Seeking" and "No Underlying Index ," from the above 2 strings. Basically, need everything from ('>) to the last character of the string.

Tried two ways,

1) The standard intuitive

($string1=~ /\'>(.*?)/) {print "got $1";}

but this does not seem to work on '>' symbol.

2) Also tried

if ($string1=~ /(?=>)(.*?)/) {print "got $1";}

based on inputs from Greater than and less than symbol in regular expressions, but it is not working.

Any inputs will be useful.

PS: Also, if the answer can include matching the "less than" symbo ("<"), that will be great!

Thanks

@stevesliva ,.. Those quotes are clear. I modified them for posting this question. Have edited the original question to double-quotes. — Aquaholic, Feb 17 '20 at 15:31
What exactly doe you mean about matching "<". Can you give an example please? — pmqs, Feb 17 '20 at 15:39
https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Mark Reed, Feb 17 '20 at 17:02

Schwern · Accepted Answer · 2020-02-17T16:54:58.520

Do not parse HTML with a regex. Regexes are very bad at parsing complex, balanced text like HTML.

For example:

<tag>
  outer
  <tag>
    middle
    <tag>inner</tag>
    middle
  </tag>
  outer
</tag>

Instead, use an HTML parser and search tools such as XPath.

Here is a demonstration using XML::LibXML.

use strict;
use warnings;
use v5.10;

use XML::LibXML;

my $html = q{
<html>
<body>
    <a href='/channels/folder1'>Alpha-Seeking</a>
    <a href='/channels/folder2'>No Underlying Index</a>
</body>
</html>
};

# Parse the HTML
my $dom = XML::LibXML->load_html(string => $html);

# Find all links.
for my $node ($dom->findnodes('//a')) {
    # Print their text.
    say $node->textContent;
}

Thanks @schwern, This works, though it needs some parsing for using HTML Parser. — Aquaholic, Feb 18 '20 at 10:36

score 3 · Answer 2 · answered Feb 17 '20 at 18:13

I must start by reiterating that it's incredibly unwise to parse HTML or XML with regexes. Please consider using a proper HTML parser.

Having said that, your problem here is pretty simple to fix. What you call the "standard intuitive approach" works fine with a simple tweak.

Here's what you have:

if ($string1=~ /\'>(.*?)/) {print "got $1";}

And your regex is \'>(.*?). That means "find a literal quote mark, followed by a greater than sign and then capture the minimum amount of anything following that". It's "the minimum amount" that's the problem. The simplest thing that .*? can capture is nothing - the empty string.

Regexes are greedy by default; they match as much as possible. You add the ? to remove that greediness and make them match as little as possible. But you don't want that here. Here, you want their greediness. So just remove that ?.

use warnings;
use strict;

my @strings = (
 "<a href='/channels/folder1'>Alpha-Seeking",
 "<a href='/channels/folder2'>No Underlying Index ,"
);

for my $string (@strings) {
  if ($string =~ /'>(.*)/) { # Note: No "?" here
    print "got $1\n";
  }
}

This displays:

got Alpha-Seeking
got No Underlying Index ,

Thanks @davecross, this works, but HTML can be multi-line where this fails. +1 for single-line working — Aquaholic, Feb 18 '20 at 10:42
@Aquaholic: If you have more complicated specifications, then it's best to mention them in your question, otherwise you'll get answers that aren't very helpful. If you want to deal with multi-line data, then you'll need to specify what defines the end of the text. — Dave Cross, Feb 18 '20 at 11:21
Agreed. Just that in this case it turned out to be additional need as more data got exposed, after I posted this q. Will be mindful in future. — Aquaholic, Feb 18 '20 at 11:26

score 2 · Answer 3 · answered Feb 17 '20 at 15:37

2

This works for me

use warnings;
use strict;

my @strings = (
 "<a href='/channels/folder1'>Alpha-Seeking",
 "<a href='/channels/folder2'>No Underlying Index ,"
);

for my $string (@strings)
{
    if ($string =~ /'>(.*?)$/) 
    {
        print "got $1\n";
    } 
}

running it gives

$ perl /tmp/abc.pl
got Alpha-Seeking
got No Underlying Index ,

answered Feb 17 '20 at 15:37

pmqs

3,066
2
13
22

Thanks @pmqs, this works, but HTML can be multi-line where this fails. +1 for single-line working. – Aquaholic Feb 18 '20 at 10:36
@Aquaholic Agree, but you question suggested you were dealing with a single-line use-case :-) – pmqs Feb 18 '20 at 11:08

score 0 · Answer 4 · answered Feb 18 '20 at 10:40

0

While exploring various options, I managed to get this working with the following:

Replace the greater than sign with some other generic symbol (like a pipe)

$string=~ s/>/\|/g;                 #Interestingly, '>' matches here without any issues

After that, split on the pipe char, and print/parse the second part:

    ($o1,$o2) = split(/\|/, $string);
    print "$o2|";

Works perfectly as a work-around.

answered Feb 18 '20 at 10:40

Aquaholic

863
9
25

2

*Interestingly, '>' matches here without any issues* But '>' has **always** matched without an issue. The problem was never with the '>', it was with the `(.*?)`. I thought we had explained that. – Dave Cross Feb 18 '20 at 11:31

perl - matching greater than charater in regex

4 Answers4