5

A client reported a problem where our software was "hanging". I eventually managed to find the piece of code responsible:

While strShortenedText.IndexOf("  ") > -1 'if there are two contiguous spaces
                                          'anywhere in this string ...
    strShortenedText = strShortenedText.Replace("  ", " ") '... replace each
                                                           'occurrence with a
                                                           'single string ...
End While '... and do this in a loop in order to handle situations with 3 or more
          'contiguous spaces

I'm afraid I won't be able to include the actual string here as it contains data covered by GDPR legislation, but I am just looking for an explanation as to how this could possibly have gone into an infinite loop?

What I find is that:

  1. strShortenedText.IndexOf(" ") returns a value greater than -1, so somewhere in this string there are two contiguous spaces (I made sure that these spaces are both 0x20 i.e. char(32))
  2. So if strShortenedText contains " " (2 contiguous spaces) anywhere, and I replace " " (2 contiguous spaces) with " " (a single space) then, surely (??????), strShortenedText after the .Replace will be shorter than it was before the replace?

Instead, I find that the REPLACE leaves the string unchanged, and no matter how often the code loops through the statement, the variable's length never changes.

How might this be possible? And what could I possibly do to handle this scenario?

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
DinahMoeHumm
  • 185
  • 10
  • 2
    You are correct. Surely it works. So the only way it doesn't is that your assumptions are wrong. Are you sure you know what string encoding you're using? – Enigmativity Jul 11 '23 at 13:46
  • It looks right to me, too. Maybe replace `" "` with a variable to ensure the thing found is the same? – Michael Foster Jul 11 '23 at 13:55
  • I would have thought that regardless of string encoding, the VB.NET method "IndexOf" should be using the same encoding as the "Replace" method? I have no doubt that at least ONE of the assumptions on which this code is based is incorrect but at this point in time I have no idea what that assumption might even be :-) – DinahMoeHumm Jul 11 '23 at 13:58
  • ok folks, the plot thickens..... When I step through the code, strShortenedText.IndexOf(" ") returns 6830 ASC(strShortenedText.Substring(6830, 1)), as expected, returns 32 however ASC(strShortenedText.Substring(6831, 1)) returns 173? and ASC(strShortenedText.Substring(6832, 1)) returns 32 again It looks like the IndexOf call finds position 6830, but it somehow doesn't "see" the character with ascii code 173 that is sitting in position 6831 – DinahMoeHumm Jul 11 '23 at 14:04
  • 2
    And, oddly enough, string encoding seems to play a part here.... – Enigmativity Jul 11 '23 at 14:05
  • 1
    I'd look at your byte array of the string you are dealing with and post those in your question. I suspect it'll become quite clear what's going on when you do. – Enigmativity Jul 11 '23 at 14:06
  • Apparently ascii character 173 is a non printing character called a "soft hyphen"... so now I just need to figure out what to do with that. – DinahMoeHumm Jul 11 '23 at 14:07
  • Okay from https://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp "A string doesn't really have encoding... in .NET the a string is a collection of char objects. Essentially, if it is a string, it has already been decoded." - which does sound reasonable to me. As for its origin... the string is coming from a VARCHAR column in a table in SQL Server (Azure SQL, with Collation SQL_Latin1_General_CP1_CI_AS if that helps?) – DinahMoeHumm Jul 11 '23 at 14:13
  • 1
    FYI, I have a hand-drawn C# version of this code in https://codeblog.jonskeet.uk/2009/11/02/omg-ponies-aka-humanity-epic-fail describing exactly this issue... – Jon Skeet Jul 11 '23 at 14:37
  • 1
    strShortenedText.Replace(" ", " ", StringComparison.Ordinal) – Hans Passant Jul 11 '23 at 14:38
  • Ooh.... Let me check that one out Hans Passant, I'll let you know what I find – DinahMoeHumm Jul 11 '23 at 14:42
  • Hi Hans Passant, This doesn't appear to work in VB.NET ? The only two overloads for the function I can see have a old and a new value, either oldChar and newChar, or oldValue and newValue. There is no overload that accepts a StringComparison parameter. – DinahMoeHumm Jul 11 '23 at 14:47
  • 1
    @HansPassant: I suspect the `IndexOf` needs to use `StringComparison.Ordinal` as well. Basically the two need to be consistent. – Jon Skeet Jul 11 '23 at 14:55
  • I would use the [String.Contains(String, StringComparison)](https://learn.microsoft.com/en-us/dotnet/api/system.string.contains?view=net-7.0#system-string-contains(system-string-system-stringcomparison)) with `StringComparison.Ordinal`. – Olivier Jacot-Descombes Jul 11 '23 at 15:27
  • Hello all - with regard to StringComparison.Ordinal solutions I'll just point out that that overload is not available in a project targeting .NET Framework 4.8 - if you come across this issue and you're lucky enough that your project targets .NET 7 this may be a better solution than mine. Thanks to all who suggested it. – DinahMoeHumm Jul 12 '23 at 12:58

6 Answers6

5

The problem is that IndexOf and Replace use different ways of finding text by default. That means that if you have a string containing U+200C (zero-width non-joiner) for example, IndexOf will skip over it and find the two spaces each side, but Replace won't.

You need to make IndexOf and Replace use a consistent way of looking for text. The simplest to understand (but potentially not the most desirable) option is to use an ordinal check:

Module Program
    Sub Main(args As String())
        Console.WriteLine(ReplaceMultipleSpaces("x " & ChrW(&H200C) & " y"))
    End Sub

    Function ReplaceMultipleSpaces(text As String)
        While text.IndexOf("  ", StringComparison.Ordinal) > -1
            text = text.Replace("  ", " ", StringComparison.Ordinal)
        End While
        Return text
    End Function
End Module

(Using String.Contains instead of String.IndexOf is probably preferable in general - but comes with the same requirement to specify the kind of check you want to perform, if you don't want to just use the default.)

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Hello Jon Skeet.... I'm not sure why this is but the .Replace function I see on strings does not have a StringComparison parameter in any of its overloads? – DinahMoeHumm Jul 12 '23 at 12:04
  • Ah .... Mystery solved. The ComparisonType overload is available in .NET 7, but our software is a few years old and it targets .NET Framework 4.8 at this point in time, and that overload is not available in that framework. – DinahMoeHumm Jul 12 '23 at 12:56
2

... what could I possibly do to handle this scenario?

Exit the loop once you didn't make any replacements

Dim lastLength = strShortenedText.Length
While strShortenedText.IndexOf("  ") > -1 
    strShortenedText = strShortenedText.Replace("  ", " ")
    If lastLength = strShortenedText.Length Then Exit While
    lastLength = strShortenedText.Length
End While

Or switch to Do...While,

Dim lastLength As Integer
Do
    lastLength = strShortenedText.Length
    strShortenedText = strShortenedText.Replace("  ", " ")
Loop While lastLength > strShortenedText.Length

If you want to actually handle the special character, this may work for you. Use a function to combine all your chars into strings which you iterate over to perform replacements. It will also allow you to add additional characters to be replaced. The combinations are in pairs, and all possible combinations, so it may be a little inefficient i.e. AA, AB, BA, BB

Private Iterator Function getPairsOfChars(chars As IEnumerable(Of Char)) As IEnumerable(Of String)
    For i = 0 To chars.Count() - 1
        For j = 0 To chars.Count() - 1
            Yield New String({chars(i), chars(j)})
        Next
    Next
End Function

Public Sub test()
    Dim strShortenedText = $"  {ChrW(173)}  s   s   s  {ChrW(173)}{ChrW(173)}    s {ChrW(173)}{ChrW(173)}{ChrW(173)}    s {ChrW(173)}s    s{ChrW(173)}s "
    Dim lastLength As Integer
    Dim pairs = getPairsOfChars({ChrW(32), ChrW(173)})
    Do
        lastLength = strShortenedText.Length
        For Each pair In pairs
            strShortenedText = strShortenedText.Replace(pair, " ")
        Next
    Loop While lastLength > strShortenedText.Length
End Sub
djv
  • 15,168
  • 7
  • 48
  • 72
  • Yes that was my original solution but it left data in the string that I wasn't happy with so I felt I needed to look at a better solution. All the same, though, I +1'ed (is that the correct terminology to use here?) your answer as it may be more good enough in some situations. – DinahMoeHumm Jul 11 '23 at 14:38
  • 1
    @DinahMoeHumm Seeing your answer, I realize you actually want to continue removing those chars, so this is not ideal. Glad you found a workaround! – djv Jul 11 '23 at 14:38
  • @DinahMoeHumm I added a solution which may do what you need – djv Jul 11 '23 at 14:53
  • Thanks @djv - That one may work very well for someone who is looking for very specific sets of characters so hopefully if my own solution doesn't work for such a person, yours may work for them! – DinahMoeHumm Jul 11 '23 at 14:58
2

As mentioned in the Comments, the cause of the problem was a non-printable character (ASCII code 173) between the two spaces. Apparently the .IndexOf method does NOT see this character, whereas the .Replace method does. As a result, the .IndexOf method finds a position where it thinks there are two contiguous spaces, but the .Replace method then cannot replace that as the actual text in that location is not two ASCII characters 32, but ASCII character 32, followed by ASCII character 173, followed by a second ASCII character 32.

The solution then is to code around that possibility:

    nDoubleSpace = strShortenedText.IndexOf("  ") 'Okay - we found a situation where there may be non-printing characters
    '                                              between two spaces, where this IndexOf method returns a value > -1 ....
    '                                              And this caused an infinite loop as the Replace did nothing.
    While nDoubleSpace > -1
        If strShortenedText.Substring(nDoubleSpace, 2) <> "  " Then
            'The IndexOf found a location with two spaces separated by non-printing characters (for example, character 173),
            'which means a .Replace won't work. Sadly this software targets .NET Framework, not .NET Core so we don't have
            'the option to use a StringComparison override in the .Replace. This is the next best thing we can do...
            'Basically, we know that the location we found with the .IndexOf STARTS and ENDS with a space. So instead of
            'doing a .Replace, we just concatenate everything before the FIRST space with everything
            'from the SECOND space onward.
            Dim nNextSpace As Integer
            nNextSpace = 1
            'Find the position of the second space
            While strShortenedText.Substring(nDoubleSpace + nNextSpace, 1) <> " "
                nNextSpace = nNextSpace + 1
            End While
            '                  everything before the first space             everything from the second space onward
            strShortenedText = strShortenedText.Substring(0, nDoubleSpace) + strShortenedText.Substring(nDoubleSpace + nNextSpace)
        Else
            'Now we know that the IndexOf and the Substring are in agreement, we can be happy that this
            'simple Replace will do the job
            strShortenedText = strShortenedText.Replace("  ", " ")
        End If

        'And check again to handle more than 2 contiguous spaces
        nDoubleSpace = strShortenedText.IndexOf("  ")
    End While

Yeah it's ugly. But it works

DinahMoeHumm
  • 185
  • 10
  • 2
    "ASCII code 173" isn't really a thing. ASCII is a 7-bit encoding - it goes as far as 127 (or 126, depending on your definition) but not up to 173. My guess is that the Unicode character you encounted was probably U+200C, a zero-width non-joiner. – Jon Skeet Jul 11 '23 at 14:38
  • Yes indeed, so I tried to come up with a solution that doesn't rely on testing for specific values, but rather anything that will cause .IndexOf to find a position while .Substring and .Replace can't find those values. – DinahMoeHumm Jul 11 '23 at 14:41
  • I will accept this as "the" answer but only because it is the solution that happened to work best for me. Other suggested answers may actually work better for other people facing a similar issue in other situations. For example people working against .NET 7 might use one of the other solutions that use StringComparison.Ordinal in a .Replace as suggested in some of the other answers. I cannot use that suggestion as the option is not available in a .NET Framework 4.8 solution and that's what I happen to be working in. – DinahMoeHumm Jul 12 '23 at 15:10
  • 1
    It's worth noting that .NET 7 isn't required - the overloads were introduced in .NET Core 2.0. So it's really just "not .NET Framework" at this point... – Jon Skeet Jul 12 '23 at 15:13
  • Cheers.... Pity they didn't see fit to retrofit it to .NET Framework, so! :-) – DinahMoeHumm Jul 13 '23 at 08:28
  • Modified the answer so it can handle multiple non-printing characters between the two spaces – DinahMoeHumm Jul 14 '23 at 12:58
  • @DinahMoeHumm - you marked your answer as the solution? Really? – dbasnett Jul 14 '23 at 14:14
  • Yes. it's the one that worked for me. I +1ed the ones that may work for somebody else with similar problems but, for example, who is coding against .NET Core. – DinahMoeHumm Jul 14 '23 at 16:53
2

Maybe some Regex?

    Dim strShortenedText As String = "More     " & Chr(173) & "  or less"
    'before the or(|) there is a soft hyphen
    Dim regx As New System.Text.RegularExpressions.Regex("(­|\s)\s+", System.Text.RegularExpressions.RegexOptions.Compiled)
    strShortenedText = regx.Replace(strShortenedText, " ")
dbasnett
  • 11,334
  • 2
  • 25
  • 33
  • 1
    I hope this Regex solution is useful to someone; I personally cannot abide Regex so I won't be using it, but that doesn't mean the quality of your answer is in dispute :-) – DinahMoeHumm Jul 12 '23 at 15:05
  • 1
    @DinahMoeHumm - I feel very strongly both ways. ;) – dbasnett Jul 12 '23 at 15:55
1

If there are other non-standard 'space' characters this approach seems the easiest to maintain.

    ' https://unicode-explorer.com/articles/space-characters
    ' 
    Dim spcChars() As Char = {" "c, ControlChars.Tab, ChrW(173), ChrW(&H2000), ChrW(8192), ChrW(8193), ChrW(8194), ChrW(8195)}

    Dim strShortenedText As String = "More     " & Chr(173) & Chr(173) & ControlChars.Tab & "  or less"
    Dim splt() As String = strShortenedText.Split(spcChars, StringSplitOptions.RemoveEmptyEntries)
    strShortenedText = String.Join(" "c, splt)
dbasnett
  • 11,334
  • 2
  • 25
  • 33
  • Oh yuck yes .... my own solution isn't that great if there are two spaces separated by more than one such "funky" character, but at this point I'm just going to go ahead and hope that that scenario won't arise. *grabs brush and sweeps it under the carpet* – DinahMoeHumm Jul 13 '23 at 08:30
  • In all fairness I'm sure it'll be fine. This situation has now arisen ONCE with hundreds of clients some of which have several million pieces of text like this scraped out of as many documents.... – DinahMoeHumm Jul 13 '23 at 08:35
  • ... but I can't just leave it so I modified my answer to make sure it handles that situation, too – DinahMoeHumm Jul 14 '23 at 12:54
0

I can't test this, since I don't have your string, but if you're looking for a double-space and find one, then Replace can't find it, how about just shortening the string?

Dim spaceIndex As Integer = strShortenedText.IndexOf("  ")
While spaceIndex > -1 'if there are two contiguous spaces anywhere in this string ...
    strShortenedText = Left(strShortenedText, spaceIndex + 1 + Mid(strShortenedText, spaceIndex + 2) '... remove the first space found
    ' Note that IndexOf is zero-based and Left and Mid are 1-based
    spaceIndex = strShortenedText.IndexOf("  ")
End While

I'm not sure what this will do with your 173 character, but I think it's worth a try, since it is a lot shorter if it works.

Michael Foster
  • 420
  • 6
  • 12