0

I have a HTML code that needs to remove the tag content. They are around 30 in number. It is found in various places inside the HTML code like

 <A class=tooltiplink href="javascript:void;" style="color:#000000"><img src="images/footnote.jpg" border="0"><SPAN style="margin:0 0 0 0px;"> unwanted info 4:6 </SPAN></A> 
<b>Hello </b>  
<A class=tooltiplink href="javascript:void;" style="color:#000000"><img src="images/footnote.jpg" border="0"><SPAN style="margin:0 0 0 0px;"> unwanted info 4:6 </SPAN>
</A><b>World</b>
<A class=tooltiplink href="javascript:void;" style="color:#000000"><img src="images/footnote.jpg" border="0"><SPAN style="margin:0 0 0 0px;"> unwanted info 4:6 </SPAN></A>

Desired output : Hello World

When I try to remove the tag content as $_=~s/A(.+)?\/A//gs; . It also takes up the useful info inside the last tag. Removing g too has the same effect. How to remove only the tag content without the first and last matching each other and also removing the useful info.

xtreak
  • 1,376
  • 18
  • 42
  • Do you have some example input and required output? – chooban Dec 13 '13 at 08:44
  • I have posted the sample code. I need only Hello and world. – xtreak Dec 13 '13 at 08:48
  • Some info are needed to give an generic exhaustive answer:1) i assume this is only partial part of a bigger html code, having maybe several case like your sample. 2) how could we define that (which criteria) the to take as delimiter is the one corresponding to your 'first' . 3) unwanted info is the one between first peer of a block 4) could we assume that the block is ALWAYS on several line or could be in 1 line also (and need correction in this case) – NeronLeVelu Dec 13 '13 at 09:26
  • I want to replace ALL peers. Ya its a inside a big HTML code with no new lines in between and . I dont get ur 2nd assumption. @NeronLeVelu – xtreak Dec 13 '13 at 09:38
  • could you put an sample of your desired output (in request, not comment due to format missing there) from you sample, it's not clear to me of what to keep and what to remove – NeronLeVelu Dec 13 '13 at 10:13
  • I added the output I just want the Hello World inside bold tag @NeronLeVelu – xtreak Dec 13 '13 at 10:21
  • I have posted the code and output . @chooban – xtreak Dec 13 '13 at 10:21

2 Answers2

2

I think that while you could do this with a regex, it's not the best way to go. The like of TreeBuilder and some XPath will give you a much more maintainable solution.

Once you've loaded the HTML into a tree structure, the XPath required might be as simple as:

my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_file( "mypage.html");

my @nodes = $tree->find_nodes( '//b' );
chooban
  • 9,018
  • 2
  • 20
  • 36
1

Your problem is that the regex is greedy, i.e. it matches the longest matching substring (form the very first A to the very last /A). Try the non-greedy version of the + operator:

$_=~s/A(.+?)?\/A//gs;

or

$_=~s/A(.*?)\/A//gs;

By the way, where are the <> characters in your regex? Don't you want to find <A> rather than just A ?

You probably mean

$_=~s/\<A\>.*?\<\/A\>//gs;

See here: How can I write a regex which matches non greedy?

Comment: It's not a good idea to parse HTML with regular expressions, as too much can go wrong (e.g. with the above approach you do not find tags with spaces in them). Unless the exercise is meant to be a quick-and-dirty solution to an ad hoc problem, use an HTML parser!

Community
  • 1
  • 1
JohnB
  • 13,315
  • 4
  • 38
  • 65