1

I'm struggling here, trying to figure out how to replace all double slashes that come after a specific word.

Example:

<img alt="" src="/pt/webf//2015//47384_1.JPG" height="235" width="378" />
<div>Don't remove this // or this//</div>

I want the string above to look like this:

<img alt="" src="/pt/webf/2015/47384_1.JPG" height="235" width="378" />
<div>Don't remove this // or this//</div>

Notice the double slashes have been replaced with just one slash in the img tag but left unscathed in the div tag. I only want to replace the double slashes IF they come after the word: pt.

I tried something like this:

(?=pt)((.*?)\/\/)+ 

However, the first thing wrong with it is (?=) does not do pattern backtracking, as far as I'm aware. That is, it'll only look for the first matching pattern. The second thing wrong with it is it doesn't work as I intended it to.

https://regex101.com/r/kC4tA5/1

Or maybe I'm going about this the wrong way, since regular expression support is not expansive in VBScript/Classic ASP and I should try to break up the string and process, instead of trying to do everything in one regular expression???

Any help would be appreciated.

Thank you.

user3621633
  • 1,681
  • 3
  • 32
  • 46
  • 1
    Where does the broken HTML come from? Can the source be fixed? – Tomalak Oct 15 '15 at 16:19
  • It's part of a VBscript that reads in snippets of certain HTML tags from a large batch of files (I didn't write the script). I could correct the actual file, but the files are created by users, so this could pop up again and again, which is why I'm trying to workaround user ID10T errors. There may be multiple files like this, in fact. Only found one, thus far. Maybe I'm better off using VBScript to break up the snippet, apply regex, and then put it back together, is that safe to say? – user3621633 Oct 15 '15 at 16:24
  • I think it's working as intended, with the catch being that you'll only ever capture the last iteration, per the note in the "explanation" pane: `Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data` – Jeff Y Oct 15 '15 at 16:39
  • I couldn't find any way to do this with a single regexp either. I think you'll need to do something like `if /pt/ then s/\/{2,}/\//g` (pseudocode). – Jeff Y Oct 15 '15 at 16:48
  • 1
    It's generally not at all advisable to run regex over HTML. Regex is technically incapable of parsing HTML, pain and despair lie down that road. Usually you would take a parser to pick apart HTML. – Tomalak Oct 15 '15 at 16:53

1 Answers1

3

I am interpreting your issue as "Removing repeated slashes in all <img src> attributes."

As I said in the comments, working with HTML requires a parser. HTML is too complex for regular expressions, all kinds of things can go wrong.

Luckily, there is a parser available to VBScript: The htmlfile object. It creates a standard DOM from your HTML string. So the solution becomes exactly as described:

Function FixHtml(htmlString)
    Dim doc, img, slashes

    Set slashes = New RegExp
    slashes.Pattern = "/+"
    slashes.Global = True

    Set doc = CreateObject("htmlfile")
    doc.Write htmlString

    For Each img In doc.getElementsByTagName("IMG")
        img.src = slashes.Replace(img.src, "/")
        img.src = Replace(Replace(img.src, "about:blank", ""), "about:", "")
    Next

    FixHtml = doc.body.innerHTML
End Function

Unfortunately, htmlfile is not the most advanced HTML parser in the world, but rest assured that it will still do way better than any regex.

There are two minor issues:

  1. I found in my tests that for some reason it insists on prepending the img.src with about: or about:blank. This should not happen, but it does. The second line of Replace() calls gets rid of the unwanted additions.

  2. The .innerHTML will produce tags names in upper case, so <img> becomes <IMG> in the output. Also insignificant line breaks in the HTML source might be removed. This is a minor annoyance, I recommend you don't obsess over it.(*)

But there are two big plus sides as well:

  1. The DOM puts you in a position where you can work with the input in a structured way. You can put in any number of complex fixes now that would have been impossible to do with regex.
  2. The return value of .innerHTML is sane HTML. It will fix any gross blunder in the input and turn it into something that is well-nested, well-escaped and otherwise well-behaved.

(*) If you do find yourself obsessing over it, you can use the wisdom from this blog post to create a function that replaces all uppercase tags that come out of .innerHTML with lowercase versions of themselves. This actually is something you can use regex for ("(</?[A-Z]+)", to be exact), because we know that there will be no stray < not belonging to a tag anywhere in the string, because that's .innerHTML's guarantee. While it would be a nice exercise (and it introduces you to the little-known fact that VBScript has function pointers), I would say it's not really worth it.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • This is actually quite outstanding. Thank you. The most difficult part is it won't always be `img`. But I guess it's not so bad. Just have to comb through the files and look for all tag types which are applicable and modify your function above, as needed. Thanks again. – user3621633 Oct 15 '15 at 17:28
  • 1
    You can use `.getElementsByTagName("*")` to get all elements. There are ways to [iterate the attributes](http://stackoverflow.com/questions/828311/how-to-iterate-through-all-attributes-in-an-html-element) as well. Have a look at the foot note I just added. – Tomalak Oct 15 '15 at 17:36
  • Thank you for the blog post reference. The script (again, not my handywork) actually uses the `lcase` VBScript function on each snippet that is applicable for processing. So, everything becomes lowercase. But in general, that blog post topic can be mighty helpful. Thanks again. – user3621633 Oct 15 '15 at 17:40
  • 1
    So you `LCase()` the entire user input unconditionally? Well, if that's all-right you can of course simply `LCase()` the `.innerHTML`. – Tomalak Oct 15 '15 at 17:42
  • Yes, that's correct. For this particular task, I'm just gluing, not rewriting the script. Yeah, `LCase()` on `.innerHTML` should suffice for this task. But in general, that blog post reference would be how I'd do it, if I was rewriting the script. Thanks so much for your efforts! – user3621633 Oct 15 '15 at 17:49
  • 2
    You're welcome. Besides, people who actually listen to "don't use regex on HTML" are more than worth the effort. – Tomalak Oct 15 '15 at 17:51