0

Currently I'm working on a project to strip all unnecessary HTML. I've got it al working but I'm using the following code to replace double spaces:

Private Function stripDubbleSpace(ByVal fileContent As String) As String
    While fileContent.IndexOf("  ") <> -1
        fileContent = fileContent.Replace("  ", " ")
    End While
    Return fileContent.Replace("  ", " ")
End Function

The code above works, but within a HREF or and SRC the url will go to a 404 when you replace a double space by 1 space. Don't ask by why there are spaces in my URL, I'm aware that's not the best way.

Example:
/images/my img.jpg (2 spaces) would be replaced by /images/my img.jpg (1 space), which should not be replaced.

How can I only replace the double spaces when it's not within a HREF or SRC?

Niels
  • 48,601
  • 4
  • 62
  • 81
  • 2
    Or in `
    ` or `
    – Rawling Dec 19 '12 at 15:21
  • The Agility Pack is easy to find elements, but can you also minify your HTML with that Library? – Niels Dec 19 '12 at 15:41

3 Answers3

1

Your code for replacing double space with single space doesn't really use regexp. If you want regexp then it should be like the following:

myurl = myurl.replace(/\s{2,}/g, ' ');

Next step is to expand above regexp to detect HREF and SRC tags and skip them.

Reference 1

Reference 2

bonCodigo
  • 14,268
  • 1
  • 48
  • 91
  • I'm aware im currently not using a REGEX, but I'm looking for a solution that does not replace double spaces in SRC or HREF. – Niels Dec 19 '12 at 15:39
  • Thanks for the information. But you are at the exact same spot where I am at the moment. How to expand that Regex, that's where my problem lies? – Niels Dec 19 '12 at 15:43
  • Niels sorry for getting back late. Perhaps I have totally overlooked the latter part. Most of `HREF SRC` regex seems to result in truly ugly and long patterns. So [we shouldn't really pass html by regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) I believe you are better off with what @Rawling mentioned: [Html Agility Pack](http://stackoverflow.com/questions/4835868/how-to-get-img-src-or-a-hrefs-using-html-agility-pack) – bonCodigo Dec 19 '12 at 19:55
0

Use the Html Agility Pack. Regex is not smart enough to parse Html with its nested structures, or at least, you end up in hopelessly complicated Regex expressions.

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
0

Finally I did not want to use a parser since this will take a lot more time for only this function. My final solution was finding all attributes KEY="VALUE". Replacing the spaces within these attributes by a tag. Then replace all double spaces with 1 space and finally replace the tag by a space. Now the attributes will still keep there spaces and I don't need a Library.

Niels
  • 48,601
  • 4
  • 62
  • 81