1
$string1="<a href='/channels/folder1'>Alpha-Seeking";
$string2="<a href='/channels/folder2'>No Underlying Index ,";

I need to extract "Alpha-Seeking" and "No Underlying Index ," from the above 2 strings. Basically, need everything from ('>) to the last character of the string.

Tried two ways,

1) The standard intuitive

($string1=~ /\'>(.*?)/) {print "got $1";} 

but this does not seem to work on '>' symbol.

2) Also tried

if ($string1=~ /(?=>)(.*?)/) {print "got $1";} 

based on inputs from Greater than and less than symbol in regular expressions, but it is not working.

Any inputs will be useful.

PS: Also, if the answer can include matching the "less than" symbo ("<"), that will be great!

Thanks

zdim
  • 64,580
  • 5
  • 52
  • 81
Aquaholic
  • 863
  • 9
  • 25

4 Answers4

3

Do not parse HTML with a regex. Regexes are very bad at parsing complex, balanced text like HTML.

For example:

<tag>
  outer
  <tag>
    middle
    <tag>inner</tag>
    middle
  </tag>
  outer
</tag>

Instead, use an HTML parser and search tools such as XPath.

Here is a demonstration using XML::LibXML.

use strict;
use warnings;
use v5.10;

use XML::LibXML;

my $html = q{
<html>
<body>
    <a href='/channels/folder1'>Alpha-Seeking</a>
    <a href='/channels/folder2'>No Underlying Index</a>
</body>
</html>
};

# Parse the HTML
my $dom = XML::LibXML->load_html(string => $html);

# Find all links.
for my $node ($dom->findnodes('//a')) {
    # Print their text.
    say $node->textContent;
}
Schwern
  • 153,029
  • 25
  • 195
  • 336
3

I must start by reiterating that it's incredibly unwise to parse HTML or XML with regexes. Please consider using a proper HTML parser.

Having said that, your problem here is pretty simple to fix. What you call the "standard intuitive approach" works fine with a simple tweak.

Here's what you have:

if ($string1=~ /\'>(.*?)/) {print "got $1";} 

And your regex is \'>(.*?). That means "find a literal quote mark, followed by a greater than sign and then capture the minimum amount of anything following that". It's "the minimum amount" that's the problem. The simplest thing that .*? can capture is nothing - the empty string.

Regexes are greedy by default; they match as much as possible. You add the ? to remove that greediness and make them match as little as possible. But you don't want that here. Here, you want their greediness. So just remove that ?.

use warnings;
use strict;

my @strings = (
 "<a href='/channels/folder1'>Alpha-Seeking",
 "<a href='/channels/folder2'>No Underlying Index ,"
);

for my $string (@strings) {
  if ($string =~ /'>(.*)/) { # Note: No "?" here
    print "got $1\n";
  }
}

This displays:

got Alpha-Seeking
got No Underlying Index ,
Dave Cross
  • 68,119
  • 3
  • 51
  • 97
  • Thanks @davecross, this works, but HTML can be multi-line where this fails. +1 for single-line working – Aquaholic Feb 18 '20 at 10:42
  • @Aquaholic: If you have more complicated specifications, then it's best to mention them in your question, otherwise you'll get answers that aren't very helpful. If you want to deal with multi-line data, then you'll need to specify what defines the end of the text. – Dave Cross Feb 18 '20 at 11:21
  • Agreed. Just that in this case it turned out to be additional need as more data got exposed, after I posted this q. Will be mindful in future. – Aquaholic Feb 18 '20 at 11:26
2

This works for me

use warnings;
use strict;

my @strings = (
 "<a href='/channels/folder1'>Alpha-Seeking",
 "<a href='/channels/folder2'>No Underlying Index ,"
);

for my $string (@strings)
{
    if ($string =~ /'>(.*?)$/) 
    {
        print "got $1\n";
    } 
} 

running it gives

$ perl /tmp/abc.pl
got Alpha-Seeking
got No Underlying Index ,
pmqs
  • 3,066
  • 2
  • 13
  • 22
  • Thanks @pmqs, this works, but HTML can be multi-line where this fails. +1 for single-line working. – Aquaholic Feb 18 '20 at 10:36
  • @Aquaholic Agree, but you question suggested you were dealing with a single-line use-case :-) – pmqs Feb 18 '20 at 11:08
0

While exploring various options, I managed to get this working with the following:

Replace the greater than sign with some other generic symbol (like a pipe)

$string=~ s/>/\|/g;                 #Interestingly, '>' matches here without any issues

After that, split on the pipe char, and print/parse the second part:

    ($o1,$o2) = split(/\|/, $string);
    print "$o2|";

Works perfectly as a work-around.

Aquaholic
  • 863
  • 9
  • 25
  • 2
    *Interestingly, '>' matches here without any issues* But '>' has **always** matched without an issue. The problem was never with the '>', it was with the `(.*?)`. I thought we had explained that. – Dave Cross Feb 18 '20 at 11:31