Match first occurence of string

Question

I have a HTML code that needs to remove the tag content. They are around 30 in number. It is found in various places inside the HTML code like

 <A class=tooltiplink href="javascript:void;" style="color:#000000"><img src="images/footnote.jpg" border="0"><SPAN style="margin:0 0 0 0px;"> unwanted info 4:6 </SPAN></A> 
<b>Hello </b>  
<A class=tooltiplink href="javascript:void;" style="color:#000000"><img src="images/footnote.jpg" border="0"><SPAN style="margin:0 0 0 0px;"> unwanted info 4:6 </SPAN>
</A><b>World</b>
<A class=tooltiplink href="javascript:void;" style="color:#000000"><img src="images/footnote.jpg" border="0"><SPAN style="margin:0 0 0 0px;"> unwanted info 4:6 </SPAN></A>

Desired output : Hello World

When I try to remove the tag content as $_=~s/A(.+)?\/A//gs; . It also takes up the useful info inside the last tag. Removing g too has the same effect. How to remove only the tag content without the first and last matching each other and also removing the useful info.

Some info are needed to give an generic exhaustive answer:1) i assume this is only partial part of a bigger html code, having maybe several case like your sample. 2) how could we define that (which criteria) the to take as delimiter is the one corresponding to your 'first' . 3) unwanted info is the one between first peer of a block 4) could we assume that the block is ALWAYS on several line or could be in 1 line also (and need correction in this case) — NeronLeVelu, Dec 13 '13 at 09:26
I want to replace ALL peers. Ya its a inside a big HTML code with no new lines in between and . I dont get ur 2nd assumption. @NeronLeVelu — xtreak, Dec 13 '13 at 09:38
could you put an sample of your desired output (in request, not comment due to format missing there) from you sample, it's not clear to me of what to keep and what to remove — NeronLeVelu, Dec 13 '13 at 10:13
I added the output I just want the Hello World inside bold tag @NeronLeVelu — xtreak, Dec 13 '13 at 10:21

score 2 · Answer 1 · answered Dec 13 '13 at 10:40

I think that while you could do this with a regex, it's not the best way to go. The like of TreeBuilder and some XPath will give you a much more maintainable solution.

Once you've loaded the HTML into a tree structure, the XPath required might be as simple as:

my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_file( "mypage.html");

my @nodes = $tree->find_nodes( '//b' );

score 1 · Answer 2 · edited May 23 '17 at 12:03

1

Your problem is that the regex is greedy, i.e. it matches the longest matching substring (form the very first A to the very last /A). Try the non-greedy version of the + operator:

$_=~s/A(.+?)?\/A//gs;

or

$_=~s/A(.*?)\/A//gs;

By the way, where are the <> characters in your regex? Don't you want to find <A> rather than just A ?

You probably mean

$_=~s/\<A\>.*?\<\/A\>//gs;

See here: How can I write a regex which matches non greedy?

Comment: It's not a good idea to parse HTML with regular expressions, as too much can go wrong (e.g. with the above approach you do not find tags with spaces in them). Unless the exercise is meant to be a quick-and-dirty solution to an ad hoc problem, use an HTML parser!

edited May 23 '17 at 12:03

Community

1
1

answered Dec 13 '13 at 08:47

JohnB

13,315
4
38
65

How to stop it at first substring itself? @JohnB – xtreak Dec 13 '13 at 08:48
Still I get the content inside link. Thanks I will try the HTML parser. But I don't want the content inside . So can HTML parser neglect content inside certain tags? @JohnB – xtreak Dec 13 '13 at 09:07

Match first occurence of string

2 Answers2