Tcl: Regsub does not substitute a string while parsing HTML snipet

Question

I'm trying to find a specific string within an array element. Since array element is a string which can contain multiple occurrences of the string I perform recursive substitution of the result. Algorithm works on simple example, but when I use it with HTML (which is the purpose of the program) it stuck in an infinite while loop.

Here is an (ugly) expression that I'm using:

set expression {\<div\sclass\=\"fileText\"\sid\=\"[^\"]+\"\>File\:\s\<a\s(title\=\"[^\"]+\"\s)?href\=\"([^\"]+)\"\starget\=\"\_blank\"\>([^\<]+)\<\/a\>[^\<]+\<\/div\>};

Here is an element of the array I from which I want to extract strings (it containes 2 occurences of the given expression):

set htmlForParse(0) {file" id="f51456520"><div class="fileText" id="fT51456520">File: <a href="//example.com" target="_blank">48912-arduinouno_r3_front.jpg</a> (1022 KB, 1800x1244)</div><a class="fileThumb" href="//example.com" target="_blank"><img " title="Reply to this post">YesNo?</a></span></div><div class="file" id="f51456769"><div class="fileText" id="fT51456769">File: <a href="//example.com" target="_blank">892991578.jpg</a> (32 KB, 400x422)</div><a class="fileThumb" href="//example.com" target="_blank"><img src};

And here are the loops that I'm using to achieve this:

for {set k 0} {$k < [array size htmlForParse]} {incr k} {
while {[regexp $expression $htmlForParse($k) exString]} {
    regsub -- $exString $htmlForParse($k) {} htmlForParse($k);
    puts $htmlForParse($k);
} }

Purpose of the regsub is to substitute one hit from regexp at a time, until no hits are left and regexp returns 0. At that moment, while loop is finished, and next element of the array can be examined. But that doesn't happen, it continues to loop forever, and it seem that regsub does not substitute found string with an empty string (nor will it substitute with anything else either). Why?

obligatory "don't parse html with regex" link: http://stackoverflow.com/a/1732454/7552 — glenn jackman, Nov 22 '15 at 19:37

score 2 · Accepted Answer · answered Nov 22 '15 at 16:38

The problem is that the string you are matching contains unquoted RE metacharacters. The ones I notice are parentheses (around the sizes):

% regexp $expression $htmlForParse($k) exString
1
% puts $exString
<div class="fileText" id="fT51456520">File: <a href="//example.com" target="_blank">48912-arduinouno_r3_front.jpg</a> (1022 KB, 1800x1244)</div>

This means that the substring you extract doesn't actually match as a regular expression in the regsub, and no change is made. Next time round the loop, you get to match everything exactly as it was once again. Not what you want!

The easiest fix is to tell the regsub that the string it is using as a pattern is a literal string. This is done by preceding the RE with ***=, like this:

while {[regexp $expression $htmlForParse($k) exString]} {
    regsub -- ***=$exString $htmlForParse($k) {} htmlForParse($k)
    puts $htmlForParse($k)
}

With your sample text, this will perform two replacements. I hope that's what you want.

Also, your initial RE has far too many backslashes in it. None of /, < and > are RE metacharacters. It's not harmful to quote them, but I hope you are generating that RE from something, not writing it by hand!

Thank you very much. I didn't know about `***=`. I have written by hand that RE, I've just started to learn them. Originally there were only few backslashes, but I've added more thinking that maybe that was the problem. — Gitnik, Nov 22 '15 at 16:53
The Tcl documentation is _dense_; it's very easy to miss things in it (or, more usually, to miss the implications of things that it says). — Donal Fellows, Nov 23 '15 at 15:23

Tcl: Regsub does not substitute a string while parsing HTML snipet

1 Answers1